Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8019705
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 4, 20262026-06-04T21:28:16+00:00 2026-06-04T21:28:16+00:00

I am trying to write a program which reads articles (posts) of any website

  • 0

I am trying to write a program which reads articles (posts) of any website that could range from Blogspot or WordPress blogs / any other website. As to write code which is compatible with almost all websites which might have been written in HTML5/XHTML etc.. I thought of using RSS/ Atom feeds as ground from extracting content.

However, as RSS/ Atom feeds usually might not contain entire articles of websites, I thought to gather all “posts” links from the feed using feedparser and then want to extract the article content from the respective URL.

I could get URL’s of all articles in website (including summary. i.e., article content shown in feed) but I want to access the entire article data for which I have to use the respective URL.

I came across various libraries like BeautifulSoup, lxml etc.. (various HTML/XML Parsers) but I really don’t know how to get the “exact” content of the article (I assume “exact” means the data with all hyperlinks, iframes, slides shows etc still exist; I don’t want CSS part).

So, can anyone help me on it?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-04T21:28:17+00:00Added an answer on June 4, 2026 at 9:28 pm

    Fetching the HTML code of all linked pages is quite easy.

    The hard part is to extract exactly the content you are looking for. If you simply need all code inside of the <body> tag, this shouldn’t be a big problem either; extracting all text is equally simple. But if you want a more specific subset, you have more work to do.

    I suggest that you download the requests and BeautifulSoup module (both avaible via easy_install requests/bs4 or better pip install requests/bs4). The requests module makes fetching your page really easy.

    The following example fetches a rss feed and returns three lists:

    • linksoups is a list of the BeautifulSoup instances of each page linked from the feed
    • linktexts is a list of the visible text of each page linked from the feed
    • linkimageurls is a list of lists with the src-urls of all the images embedded in each page linked from the feed
      • e.g. [['/pageone/img1.jpg', '/pageone/img2.png'], ['/pagetwo/img1.gif', 'logo.bmp']]
    import requests, bs4
    
    # request the content of the feed an create a BeautifulSoup object from its content
    response = requests.get('http://rss.slashdot.org/Slashdot/slashdot')
    responsesoup = bs4.BeautifulSoup(response.text)
    
    linksoups = []
    linktexts = []
    linkimageurls = []
    
    # iterate over all <link>…</link> tags and fill three lists: one with the soups of the
    # linked pages, one with all their visible text and one with the urls of all embedded
    # images
    for link in responsesoup.find_all('link'):
        url = link.text
        linkresponse = requests.get(url) # add support for relative urls with urlparse
        soup = bs4.BeautifulSoup(linkresponse.text)
        linksoups.append(soup)
    
        linktexts.append(soup.find('body').text)
        # Append all text between tags inside of the body tag to the second list
    
        images = soup.find_all('img')
        imageurls = []
        # get the src attribute of each <img> tag and append it to imageurls
        for image in images:
            imageurls.append(image['src'])
        linkimageurls.append(imageurls)
    
    # now somehow merge the retrieved information. 
    

    That might be a rough starting point for your project.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am trying to write a python program which could take content and categorize
I am trying to write a c++ program that would read key frames from
I'm trying to write program which will switch tasks. Everything seems to work properly,
I am trying to write a program which will allow two players to play
I am trying to write a program which gets input files and prints included
I am trying to write a program in PHP which I had already written
I'm trying to write a program in C++ which runs Conway's Game of Life.
I'm trying to write a program in Prolog, which will insert an element into
I am trying to write a simple program using Lucene 2.9.4 which searches for
I am trying to write a Java program or Hadoop Pig script which will

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.