Wednesday, October 25, 2017

Webpage Parsing and Reminding Myself About Things I Have Already Learned

Scaffold will now take in a local file name or URL as input, verify which one of those two input types it has been given, and open and read the page contents from a valid url. 

The next challenge is for Scaffold to be able to identify and extract the body text from a webpage it has been given. NLTK used to have a clean_html function that would do this, but it appears to have since been deprecated. 

I am going to do a bit of research to see if there are any up-to-date and reliable python packages that will take care of this. If I can't find a free tool that has the functionality I need, I'm probably going to dive into the task myself and see how it goes. 

In all honesty, I am really tempted to try to build my own tool from scratch to isolate the body text from a webpage; it sounds like a fun mini-project. I would, however, also like to remind myself that reinventing the wheel every time I run into an interesting problem is a great way to never release a finished version of this project. 

I basically ran through the same debate when determining how to handle the named-entity recognition  and part-of-speech tagging tasks for Scaffold. I had designed a system of proper noun classification functions that would take phrase context into consideration and assign a weighted probability for each possible classification that a token could be. I began coding and got through a decent portion of  this numerical probability-driven classification system. When it came to determining how I was going to train this system to provide reasonable probability values for each classification, it emerged that I was trying to solve a much more complex problem than I realized.

A fork had appeared in my development path. On one path, I would continue to try building my original classification system and hope that it would reliably identify people/locations/etc. after enough training and hacky application of NLP principles. On the other path, I would put original design aside and let NLTK take the NER weight off of my shoulders, at least for the version one release. 

If it was not clear already from the bent of my storytelling, I chose the latter path. While it was hard to abandon an idea I was so excited about owning, I'm glad I did it for the sake Scaffold surviving its natal state.

TL;DR I should try to find a tool to solve this webpage parsing problem before falling too far down the rabbit hole trying to build my own. Once the first Scaffold release is up and running, I can swap out all of the tools that do a pretty good job of what I need for more tailor-made, ambitious stuff. 


Monday, October 23, 2017

Switching Input Method

In the interests of testing, standardizing input, and improving ease of use my next work item is to transition Scaffold from taking in plaintext files as input to taking the urls of articles as input.

This was originally intended to be an improvement implemented after the v1 webapp is up and running, but I'm pushing it ahead. The more I think about it, the more I realize there was no sane world in which I manually collected, copied, pruned, and stored all of the testing documents. I will keep the option to read input from a pre-existing plaintext file as well, though.

I merged the verbosely named "Development_to_Overcome_Errors_Found_During_Testing" branch back into Master after cleaning up the first round of parsing errors. I'm going to close down that branch and make a new one dedicated to transitioning the input method.


Tuesday, October 17, 2017

Overview of Scaffold's Direction Going Forward

It feels a little neglectful for my initial post to be void of introduction to the project, but I'm going to opt to get around to writing up a separate page for that in the coming days.

Next Tasks To Tackle

More Robust Pre-processing of Test Documents - Currently, Scaffold's parsing logic does not hold up well against formatting errors in its input documents. Ideally, all of the input to Scaffold would be free of non-ascii characters, non-standard whitespace distribution, and unexpected punctuation usage. Realistically, these are all common occurrences and every new article I have transferred to a plaintext document for testing has had to be sanitized. This is a pretty serious bottleneck in running test cases and also makes it more difficult to discern whether errors have occurred due to Scaffold's parsing logic or unexpected input formatting.

Building A Rich and Representative Corpus for Testing - Scaffold's target input is journalistic articles written using the AP Style writing guidelines. This means that first and foremost the testing corpus should contain articles from a wide variety of popular publications that adhere to the AP stylebook. The articles themselves also need to vary - in length, author background, tone, and subject matter. I'm going to try to be mindful about including an unbiased and fairly representative selection of articles, especially since I am the only one overseeing the selection.

I'm going to add some non-AP style adherent articles into the testing line-up as well. Sticking to a single set of syntax and grammar expectations has been really helpful in scoping my first iteration of Scaffold, but doesn't satisfy my ultimate vision for the project.

Scaffold is intended to aid users performing investigatory research. I don't want Scaffold to have the side-effect of limiting the sources that users include in their research. There is a wealth of reputable, reliable publications and independent pieces of writing do not use AP guidelines. Conversely, a piece of journalism that adheres to the AP stylebook is not necessarily a trustworthy, well-researched source. That being said, projects thrash and drown without scope.

For now, I am going to settle for including a selection of non-AP documents in my tests because at the very least I don't want Scaffold to completely crap itself if it is given such a document in the wild. I don't expect users to be able to know off-hand whether or not an article that they feed to Scaffold is AP-style compliant. Okay, actually, side note - I should build a bit of code that can take note of whether an article is AP-style compliant. Then users can get an "I'm going to try my best, but no guarantees " message for non-AP style documents.

To keep things simple legally, I'm not planing on publishing the content of the corpus I use for testing.

Automating Testing Process and Documenting of Test Results - The testing corpus should be able to run through Scaffold. From there, I want data regarding error occurrence across the entire corpus (minus non-AP style documents whose errors will be considered separately.) I'll elaborate on this once I have executed the previous two tasks and this is closer to the front of my docket.

Once That's Taken Care Of...

The ultimate goal of building a more extensive corpus for testing and have automated the testing process, I  data on what kind of errors Scaffold is making, how frequently it's making them, and how detrimental they are to program usage, I am going to establish an informed set of success criteria.

This success criteria is going to reflect my expectations for parsing and named-entity recognition performance given the limitations of my primarily heuristic approach to these tasks.

My goal is to ship a first version of Scaffold that is built using nltk and a big heap of rule-of-thumb logic. The current plan is to build a simple webapp that makes Scaffold's functionality accessible.

Once a first version of a Scaffold webapp is up and running, I'll revisit the backend logic and start gutting it in favor of a more ML-driven approach. Which brings me to...

In the Meantime...

While I work on completing the first version of Scaffold, I have also been starting to dive into learning about machine learning and ML-driven NLP. A large part of why I am interested in making Scaffold is because I want to learn bout natural language processing. Effective NLP techniques rely on machine learning. There's a good chance I'll blog about it here once in a while.


Late January Updates

Nothing too big, this time, just wanted to pop over here for a check-in. My eagerness to move forward with my work on Scaffold has been mom...