Bit of a random one, i am wanting to have a play with some NLP stuff and I would like to:
Get all the text that will be displayed to the user in a browser from HTML.
My ideal output would not have any tags in it and would only have fullstops (and any other punctuation used) and new line characters, though i can tolerate a fairly reasonable amount of failure in this (random other stuff ending up in output).
If there was a way of inserting a newline or full stop in situations where the content was likely not to continue on then that would be considered an added bonus. e.g:
items in an ul or option tag could be separated by full stops (or to be honest just ignored).
I am working Java, but would be interested in seeing any code that does this.
I can (and will if required) come up with something to do this, just wondered if there was anything out there like this already, as it would probably be better than what I come up with in an afternoon ;-).
An example of the code I might write if I do end up doing this would be to use a SAX parser to find content in p tags, strip it of any span or strong etc tags, and add a full stop if I hit a div or another p without having had a fullstop.
Any pointers or suggestions very welcome.
HTML parsers seem to be a reasonable starting point for this.
there are a number of them for example: HTMLCleaner and Nekohtml seem to work fine.
They are good as they fix the tags to allow you to more consistently process them, even if you are just removing them.
But as it turns out you probably want to get rid of script tags meta data etc. And in that case you are better working with well formed XML which these guy get for you from “wild” html.
there are many SO questions relating to this (like this one) you should search for “HTML parsing” though 😉