I am trying to write a program which reads articles (posts) of any website that could range from Blogspot or WordPress blogs / any other website. As to write code which is compatible with almost all websites which might have been written in HTML5/XHTML etc.. I thought of using RSS/ Atom feeds as ground from extracting content.
However, as RSS/ Atom feeds usually might not contain entire articles of websites, I thought to gather all “posts” links from the feed using feedparser and then want to extract the article content from the respective URL.
I could get URL’s of all articles in website (including summary. i.e., article content shown in feed) but I want to access the entire article data for which I have to use the respective URL.
I came across various libraries like BeautifulSoup, lxml etc.. (various HTML/XML Parsers) but I really don’t know how to get the “exact” content of the article (I assume “exact” means the data with all hyperlinks, iframes, slides shows etc still exist; I don’t want CSS part).
So, can anyone help me on it?
Fetching the HTML code of all linked pages is quite easy.
The hard part is to extract exactly the content you are looking for. If you simply need all code inside of the
<body>tag, this shouldn’t be a big problem either; extracting all text is equally simple. But if you want a more specific subset, you have more work to do.I suggest that you download the requests and BeautifulSoup module (both avaible via
easy_install requests/bs4or betterpip install requests/bs4). The requests module makes fetching your page really easy.The following example fetches a rss feed and returns three lists:
linksoupsis a list of the BeautifulSoup instances of each page linked from the feedlinktextsis a list of the visible text of each page linked from the feedlinkimageurlsis a list of lists with thesrc-urls of all the images embedded in each page linked from the feed[['/pageone/img1.jpg', '/pageone/img2.png'], ['/pagetwo/img1.gif', 'logo.bmp']]That might be a rough starting point for your project.