I’m interested to find out how to scrub a html page and present it nicely — remove all the clutters and reformat the main text into a very readable format — like http://lab.arc90.com/experiments/readability or Instapaper.
Is it a simple page parsing and removing elements that are not within
?
Was this discussed somewhere else?
https://github.com/jiminoc/goose/wiki does something like you’re asking, source code is openly available along with unit tests