I am working on a hobby project in which I have to crawl different web pages, do some analysis and answer some query. For example, a web page can have data like:
One people injured in robbery.
Two people were injured in attempted robbery case last night.
Police is looking for the persons who injured three persons in attempted robbery.
I am interested in answering queries like how many persons were injured in each of these incidents. My question is how can I do it. Are there any libraries that can help me doing this task?
Try out the Stanford CoreNLP demo. It is used as Part Of Speech Tagger. It generates a XML output and pretty print output and shows “one” in “one man injured in robbery” as a number. Just try it out.. this can be really helpful to you.
Then, you can use DOM parser in java to parse the XML file and you can easily separate out the “one” by checking for “NER” tag in the file and see if it is a number.