The next challenge is for Scaffold to be able to identify and extract the body text from a webpage it has been given. NLTK used to have a clean_html function that would do this, but it appears to have since been deprecated.
I am going to do a bit of research to see if there are any up-to-date and reliable python packages that will take care of this. If I can't find a free tool that has the functionality I need, I'm probably going to dive into the task myself and see how it goes.
In all honesty, I am really tempted to try to build my own tool from scratch to isolate the body text from a webpage; it sounds like a fun mini-project. I would, however, also like to remind myself that reinventing the wheel every time I run into an interesting problem is a great way to never release a finished version of this project.
I basically ran through the same debate when determining how to handle the named-entity recognition and part-of-speech tagging tasks for Scaffold. I had designed a system of proper noun classification functions that would take phrase context into consideration and assign a weighted probability for each possible classification that a token could be. I began coding and got through a decent portion of this numerical probability-driven classification system. When it came to determining how I was going to train this system to provide reasonable probability values for each classification, it emerged that I was trying to solve a much more complex problem than I realized.
A fork had appeared in my development path. On one path, I would continue to try building my original classification system and hope that it would reliably identify people/locations/etc. after enough training and hacky application of NLP principles. On the other path, I would put original design aside and let NLTK take the NER weight off of my shoulders, at least for the version one release.
If it was not clear already from the bent of my storytelling, I chose the latter path. While it was hard to abandon an idea I was so excited about owning, I'm glad I did it for the sake Scaffold surviving its natal state.
TL;DR I should try to find a tool to solve this webpage parsing problem before falling too far down the rabbit hole trying to build my own. Once the first Scaffold release is up and running, I can swap out all of the tools that do a pretty good job of what I need for more tailor-made, ambitious stuff.
No comments:
Post a Comment