Before I can build a system that automatically classifies text, I need to manually classify a whole bunch of samples as a training/evaluation set. Is there some existing tool that will let me manually tag thousands of items without too much pain? And if not, what’s the quickest way to whip something together?
As an example, imagine you have a bunch of Twitter messages. You’d like to put them in particular buckets: happy, sad, funny, angry, and spam. Some things go in multiple buckets. You could just dump everything into a file and insert some tags with vi, but that’s error-prone and kinda slow. More importantly, having a nice interface means maybe you can talk your colleagues into doing a bunch of the work. Web, GUI, or console doesn’t matter much; just as long as it’s quick and easy. Is there anything like that?
I’m hoping yes, although I can’t find anything with Google. If I have to build something, is there a good place to start? From rummaging, my first impression is that Rails + jQuery + acts_as_taggable_on + jQuery Tokenizing Autocomplete seems ok, but I’m open to other things.
In my case, I ended up building something with Ruby’s HighLine module for command-line interfaces. It’s not as fancy as a web-based interface, but it was simple to build and, thanks to its single-character mode, very fast to use.