Scaffold

Tuesday, January 23, 2018

Late January Updates

Nothing too big, this time, just wanted to pop over here for a check-in. My eagerness to move forward with my work on Scaffold has been momentarily eclipsed by my need to complete the remainder of my programming assignments for the ML course I'm taking before my MATLAB license runs out.

In the meantime, however, I've been setting up to move forward with integrating a MySQL database into Scaffold. The app is up and running locally on my Ubuntu VM after a rousing round of package installs. My immediate next move is researching recommended security protocol for AWS RDS db instances.

The largest obstacle right now is trying to develop in a Ubuntu VM that is running on laptop with a limited amount of RAM. In terms of actual project impact, it just means more work time eaten up by freezing, crashing, and having to check and clean-up what applications are running. I'm sure more diagnostic work could return some modifications to VirtualBox or Windows 10 that will free up RAM. I'm less sure that it will be enough RAM to properly address my issue, but it's worth an hour of Googling.

I want to make a joke about using this development obstacle as a petty excuse to buy a new laptop but the fact that this is a borrowed machine* and my own ancient laptop is broken and legitimately in need of replacement really blows the set up.

Anyway, as I mentioned before, progress may be a little mellow right now due to my coursework and a couple of other projects that have come up, but hopefully I should be able to refocus more attention on Scaffold soon. Until then, thanks for bearing with me and may your RAM be ever-bountiful.

*A borrowed machine for which I am very grateful

Monday, January 8, 2018

Testing Out A Deployment Pipeline

I've been establishing a deployment pipeline. I found a straight-forward tutorial for deploying a python/flask application through free-tier AWS tools. The only hitch was that the tutorial was written for use with a Linux system and I intended to deploy from Windows 10. I read through the tutorial and ultimately decided that I had enough experience working on both a Linux Ubuntu terminal and Windows CLI to translate as I went.

This is an overview of the process:

-create a Python virtual environment to deploy from using virtualenv
-configure the virtual environment with the project's required packages
-launch RDS MySQL database using that sweet, sweet AWS free-tier. adjust security/access
-hook up backend of the python app to the newly created RDS MySQL database instance

optional: Play around with application locally! The sample app in the tutorial allows users to add entries to a database and then retrieve a specified number of them chronologically.

-install AWS Elastic Beanstalk CLI
-In the AWS Console, create a user with the required permissions to interface with the EBCLI
-launch the EBCLI and begin configuring the application's deployment settings

So, EBCLI deploys updates to the application code via git which is really neat and handy. The EBCLI could not find my git path. This led to a couple hours of what I can only describe as calculated molestation of the Windows system path environment variables. After all of my cajoling could not shake EBCLI's assertions that I didn't have git installed, I decided that I was better off preserving the integrity of my system and setting up a Ubuntu VM to deploy from.

Post holidays, this is exactly what I did and it ended up working like a charm after I ate the prerequisite time to re-create the environment config I needed. From here I just had to:

-resolve a corrupted commit from the source code's git log (step not included in the tutorial)
-set up a server location/python version/etc. in EBCLI
-decide on a DNS CN with some flair in EBCLI

And that was that! For at least a couple of days, http://flaskparty2k18.us-west-2.elasticbeanstalk.com/ will be up online for lovers of simple and insecure web-apps to enjoy! (More stringent security protocols will be employed during Scaffold's deployment process.)

I'm looking forward to trying to replicate this process in a couple of days when I am free to work on Scaffold again. I should mention that this will involve adding the code to link the database I set up in AWS to my flask app.

This blog post is dedicated to my old Lenovo T430s who during this process fell ill, was disassembled, reassembled, and retired to a closet to be accessed only remotely.

EDIT 1/8/2018 - Originally I titled this post "Ubuntu Be Kidding Me!" which, on second thought, I just could not live with.

Scaffold: Waste Less Time Reading Fluff Journalism! Spend More Time Reading Long Blog Posts and Their Long Titles!

I wanted to take a moment to expand on why I am creating Scaffold and trying to put it out into the world. Aside from the shoe-in fact that I just plain love programming, learning new skills, and building software.

My core motivation to build Scaffold and not something else is because I want to facilitate the digestion of journalistic articles that are information-rich. The internet is a glorious slip-and-slide for gliding through swaths of research sources. Unfortunately, the majority of the sprinklers are lubricating the ride with terrible journalism. More specifically, I'm hung up on the plethora of articles (even from overall respectable institutions) that are composed of:

- a single [quote/fact/event/conclusion from "a study" ] nugget. Possibly two nuggets if lucky.
- rhetorical questions
-hype-forward implied correlations to unrelated but more intriguing topics
-a heaping scoop whatever the most recent overly-used adjectives are

For individuals performing research that requires a variety of sources, I want it to be easier to ignore these articles before they waste their time reading them. And frankly, some information-rich articles simply suffer from a non-representative title that make the reader believe that their topic will be covered more significantly than it is. Some sites will beta-test several article titles before choosing one permanently. Click-through counts are worth more than actual article relevance.

My goal is to minimize the amount of "easily-digestible" sources that users conducting topic research are forced to rely on due to time being wasted. If someone has thirty minutes to find a source and spends it reading a multitude of technically-relevant and easy-to-find but crappy sources then at the end of thirty minutes it becomes a matter of settling on the least crappy source. God forbid if they also have to spend a portion of that time parsing through the article to present its contents.

I'm not here on the grounds of journalistic integrity. I'm building Scaffold because my time while performing topic-research has been wasted by unhelpful text articles and it bothers me to have my time wasted when the world is full of perfectly good CPUs that can be instructed to solve problems for me. I want Scaffold to specifically be an easy-to-use webapp because the nineteen years of my life that occurred before I learned how to program was when the majority of this wasted time occurred.Which is to say that non-technical persons deserve the benefit of technically-driven tools because their time is also valuable.

Frankly, if I can expand the abilities of the app and save the time of the journalists producing Scaffold's target input content, I hope it means that it's easier for them to do the research necessary to produce quality articles. Thus allowing me, a person not experienced in conducting journalism to have the benefits of well-researched journalism. Which is kind of an idealist synergy-obsessive fever dream for the moment. But I stand by it.

Bearing that in mind, Scaffold doesn't verify facts or provide any kind of clarity as to whether the information found in the articles has been considered in a thoughtful way or given appropriate context. Scaffold doesn't pull out purely qualitative specifics even if they are objectively true.

Whenever I talk about what I am creating Scaffold for, I feel negligent if I don't emphasize the fact that it would be dangerous and inappropriate to confuse the presence of quotes/statistics/dates and times/specific citations to places and people etc. with a trustworthy source. Articles can cite studies that are inaccurate or outright shams. Partial quotes can be taken out of context. Date and times need meaningful and verifiable events attached to them in order to have worth. Mentioning the names of every board member that showed up to a key meeting doesn't guarantee that the upshot of that meeting and the role of each individual will be accounted for. What I will say, is that rarely do informative articles lack all of the features that Scaffold aims to isolate.

I Hope This Blog Post Isn't Used As Evidence Against Me In Court

Given my plan to set up a database for storing raw input to Scaffold, something on my mind has been what legal red-tape I may eventually run into.

To be perfectly honest, I am not sure what the legal implications are of storing published content that I do not own on a server whose contents I manage. On one hand, this is a not-for-profit open source application. On the other hand, this could definitely present issues if I were interested in the eventual implementation of accounts that could store and access previously Scaffold-ed content. Which, for the record, I am interested in since I believe it would be exceptionally helpful for users conducting research and information compilation across multiple sessions.

As a bonus thought exercise: if I were to use a corpus containing copy-righted documents to train an ML model what are the resulting limitations on my use of the model?

Down the line, I'll certainly revisit investigating the legality of article storage and access for users before embarking on the development of any features that would suck to have to immediately roll back due to litigation. To be clear, I also seriously care about not screwing over the content providers, even if unintentionally.

For now, I think I am safe being primarily concerned with first getting to the point where Scaffold is useful and well-liked enough to be worth suing over.

UI and Feedback Feature Updates

Significant improvements have been made to the UI. These changes include table formatting, content listing choices, addition of a page for a tutorial (the content of which will be written closer to the release), and making style changes to the font and color of the page.

I decided to roll back the UI elements intended to gather feedback for individual entity results and classified statements. For the individual entity results (on the UI these are described as "people", "locations", and "other named subjects" ) I added two columns to the table with drop down menus. One indicated the whether an entry had been correctly chunked (i.e. a person entry contained both the first and last name and excluded any other tokens/words). The other was to provide feedback on the classification of the entry (i.e. is an entry found in the person table correctly categorized or should it have been tagged as a location or general named entity or not categorized at all).

For the entire-phrase results, I used a simple radio button to indicate whether a phrase had been correctly identified as containing a quote/stat/date/time.

This decision to remove these feedback features was made for multiple reasons. My intentions for the data I would have been collecting weren't driven by a strong enough sense of direction. I began implementing the UI elements thinking that it would be helpful to capture feedback from the current Scaffold algorithm implementation in order to improve future versions. I began, however, to consider what specific role this data would play in significantly improving the accuracy of Scaffold's results going forward.

I began researching established that the best chance I had improving the accuracy of my named entity chunking results was training my own as opposed to using the pre-trained chunker that NLTK provides. It's an option that I didn't have the experience for going into the project, but would feel confident about taking on now having acquired a significantly better understanding of NLP and ML concepts than when I began.

My conclusion from all of this was that I wasn't going to improve any of the end-product use cases by implementing the feedback-gathering features. On top of that, the feedback I was planning on gathering would not have been of the volume or containing the qualities that could contribute to the algorithmic improvements they were intended to support. The hit to the release timeline couldn't be justified, so feedback elements and all plans to hook them up to a database were shelved in good conscience.

My concession is that I'm planning to deploy with an AWS RDS MySQL database tied to the application in order to at least catch the raw text articles being input to Scaffold. This should at least open the doors to building a training set and understanding how Scaffold is being used.

What's Up With All Of The Blog Posts Today?

I hope the first week of 2018 has been treating everyone alright!

I've been taking down thoughts and notes for a couple of weeks now on a few different Scaffold-related subjects and finally managed to congeal them into discrete topics and manageable trains of thought!

Enjoy!

Monday, December 4, 2017

Happy December!

It's been a hot minute since I posted any status updates on Scaffold so I wanted to check in. Long story short, I traveled out to the east cost for a chunk of November. Once I got back, I switched my focus to catching up on machine learning coursework and cramming in job-hunting todos before most of the tech world goes on vacation for the latter half of December.

Moving forward, I'm trying to cook up a schedule with the end goal of releasing the first version of Scaffold before I check-out for holiday traveling. This would establish my release date as December 22nd. It might be a little tight with everything else going on, but the primary functionality and bare-bones UI components have been implemented. If I don't put a deadline on this thing, I could reach to improve its features and generate new work items until the sun burns out.