I am building a news reader and I have an option for users to share article from blog, website, etc. by entering link to page. I am using two methods for now to determine the content of page:
-
I am trying to extract rss feed link from page user entered and then match that url in feed to get right item.
-
If site doesn’t cointain feed or it’s malformed or entered address differes from item link in rss(which is in about 50% cases if not more) I try to find og meta tags, and that works great but only bigger sites have that, smaller sites and blogs usually have even same meta description for whole website.
I am wondering how for example Google does it? When website doesn’t cointain meta description Google somehow determines by itself what is content on page for their search results.
I am using HtmlAgilityPack to extract stuff from pages and my own methods to clean html to text.
Can someone explain me the logic or best approach to this, If I try to crawl it directly from top I usually end up with content from sidebar, navigation etc.?
I ended up using Boilerpipe which is written in JAVA,imported it using IKVM and it works well for pages that area formated correctly, but it still has troubles with some pages where content is scattered.