Is there anyway I can parse a website by just viewing the content as displayed to the user in his browser? That is, instead of downloading “page.htm”l and starting to parse the whole page with all the HTML/javascript tags, I will be able to retrieve the version as displayed to users in their browsers. I would like to “crawl” websites and rank them according to keywords popularity (viewing the HTML source version is problematic for that purpose).
Thanks!
Joel
A browser also downloads the page.html and then renders it. You should work the same way. Use a html parser like lxml.html or BeautifulSoup, using those you can ask for only the text enclosed within tags (and arguments you do like, like title and alt attributes).