Tuesday, October 17, 2017

Overview of Scaffold's Direction Going Forward

It feels a little neglectful for my initial post to be void of introduction to the project, but I'm going to opt to get around to writing up a separate page for that in the coming days.

Next Tasks To Tackle

More Robust Pre-processing of Test Documents - Currently, Scaffold's parsing logic does not hold up well against formatting errors in its input documents. Ideally, all of the input to Scaffold would be free of non-ascii characters, non-standard whitespace distribution, and unexpected punctuation usage. Realistically, these are all common occurrences and every new article I have transferred to a plaintext document for testing has had to be sanitized. This is a pretty serious bottleneck in running test cases and also makes it more difficult to discern whether errors have occurred due to Scaffold's parsing logic or unexpected input formatting.

Building A Rich and Representative Corpus for Testing - Scaffold's target input is journalistic articles written using the AP Style writing guidelines. This means that first and foremost the testing corpus should contain articles from a wide variety of popular publications that adhere to the AP stylebook. The articles themselves also need to vary - in length, author background, tone, and subject matter. I'm going to try to be mindful about including an unbiased and fairly representative selection of articles, especially since I am the only one overseeing the selection.

I'm going to add some non-AP style adherent articles into the testing line-up as well. Sticking to a single set of syntax and grammar expectations has been really helpful in scoping my first iteration of Scaffold, but doesn't satisfy my ultimate vision for the project.

Scaffold is intended to aid users performing investigatory research. I don't want Scaffold to have the side-effect of limiting the sources that users include in their research. There is a wealth of reputable, reliable publications and independent pieces of writing do not use AP guidelines. Conversely, a piece of journalism that adheres to the AP stylebook is not necessarily a trustworthy, well-researched source. That being said, projects thrash and drown without scope.

For now, I am going to settle for including a selection of non-AP documents in my tests because at the very least I don't want Scaffold to completely crap itself if it is given such a document in the wild. I don't expect users to be able to know off-hand whether or not an article that they feed to Scaffold is AP-style compliant. Okay, actually, side note - I should build a bit of code that can take note of whether an article is AP-style compliant. Then users can get an "I'm going to try my best, but no guarantees " message for non-AP style documents.

To keep things simple legally, I'm not planing on publishing the content of the corpus I use for testing.

Automating Testing Process and Documenting of Test Results - The testing corpus should be able to run through Scaffold. From there, I want data regarding error occurrence across the entire corpus (minus non-AP style documents whose errors will be considered separately.) I'll elaborate on this once I have executed the previous two tasks and this is closer to the front of my docket.

Once That's Taken Care Of...

The ultimate goal of building a more extensive corpus for testing and have automated the testing process, I  data on what kind of errors Scaffold is making, how frequently it's making them, and how detrimental they are to program usage, I am going to establish an informed set of success criteria.

This success criteria is going to reflect my expectations for parsing and named-entity recognition performance given the limitations of my primarily heuristic approach to these tasks.

My goal is to ship a first version of Scaffold that is built using nltk and a big heap of rule-of-thumb logic. The current plan is to build a simple webapp that makes Scaffold's functionality accessible.

Once a first version of a Scaffold webapp is up and running, I'll revisit the backend logic and start gutting it in favor of a more ML-driven approach. Which brings me to...

In the Meantime...

While I work on completing the first version of Scaffold, I have also been starting to dive into learning about machine learning and ML-driven NLP. A large part of why I am interested in making Scaffold is because I want to learn bout natural language processing. Effective NLP techniques rely on machine learning. There's a good chance I'll blog about it here once in a while.


No comments:

Post a Comment

Late January Updates

Nothing too big, this time, just wanted to pop over here for a check-in. My eagerness to move forward with my work on Scaffold has been mom...