I’m writing a special crawler-like application that needs to retrieve the main content of various pages. Just to clarify : I need the real “meat” of the page (providing there is one , naturally)
I have tried various approaches:
- Many pages have rss feeds , so I can read the feed and get this page specific contnent.
- Many pages use “content” meta tags
- In a lot of cases , the object presented in the middle of screen is the main “content” of the page
However , these methods don’t always work , and I’ve noticed that Facebook do a mighty fine job doing just this (when you want to attach a link , they show you the content they’ve found on the link page) .
So – do you have any tip for me on an approach I’ve over looked?
Thanks!
There really is no standard way for web pages to mark “this is the meat”. Most pages don’t even want this because it makes stealing their core business easier. So you really have to write a framework which can use per-page rules to locate the content you want.