I need to recognize content in a page – to do something as so http://www.alchemyapi.com/api/text/ (I need to get the HTML so I cant use this API)
What logic can I use to accomplish this? (Coding language is not matter)
Here what I did (with a good result) – needs a lot more fixes…
- Find the most text in page so don’t have a breaking tags – ignore inline tags (span, b, etc…)
- Go up one level and count breaking tags (br, p, div, etc…)
- Go up another level and count tags
- Compare tags count from step 2 with step 3
- If there is a lot of different we stop here – if not we go to step 3
Look for the Boilerpipe library. It is a comprehensive solution.
Using the Boilerpipe library, you can specify the output as HTML. So you get the main content(the article) while still preserving its HTML tags.