I’m a college student looking for a NLP library to perform subject extraction and sentiment analysis in a Java-based web application for a summer-hobby project.
To give you a little context on what I’m trying to do… I want to build a Java-based web application that will extract subjects out of a Reddit submission’s headlines, as well as identify the OP’s sentiment for the headline (when possible).
Example Inputs:
- Reddit, we took the anti-SOPA petition from 943,702 signatures to
3,460,313. The anti-CISPA petition is at 691,768, a bill expansively
worse than SOPA. Please bump it, then let us discuss further measures
or our past efforts are in vain. We did it before, I’m afraid we are
called on to do it again. - My friend calls him “Mr Ridiculously Photogenic Guy”
- Insanity: CISPA Just Got Way Worse, And Then Passed On Rushed Vote
I’m currently trying out AlchemyAPI, but it sounds like better NLP libraries exist out there. Preferablly, I wouldn’t be restricted to a limited number of API requests in a given time period (AlchemyAPI has a quota). I’ve heard the names of GATE, LingPipe, and OpenNLP – however, I’m unsure whether they fit my needs.
I’m looking for framework/library/api recommendations, or even better, comparisons from experienced users. My experience with NLP is extremely limited, which is why I’m asking for help here (ps: if anyone has any resources for learning more, outside of http://www.nlp-class.org, please let me know!) 🙂
First, I’d highly recommend using python, as the NLP libraries are a bit more user friendly than java, and it’d be a lot less code to maintain for a one-man project.
I can’t think of anything off the top of my head to do either classification, so my recommendation would be to train two classifiers, one for subject, and one for sentiment. You’ll have to label data and define features, but I think that wouldn’t be too hard, especially with sentiment where you build up a dictionary of ’emotion’ words. Labeling data is a pain in the ass, but that and good features are how you get good classification.
Subject Classifier:
Use NLTK with a Naive Bayes classifier, and define features as the word (lowercased), and word bigrams and trigrams.
Sentiment Classifier:
Same features as subject classifier, but also have a feature that says word w is in emotion dictionary with connection c. So, word ‘bad’ means ‘bad sentiment’.
Once you’ve amassed sufficient training/testing data, you train your classifiers and optimize features, if necessary, and then you can run the classifiers against whatever other data you want.
General Purpose Libraries (Java):
Libraries (Python):