Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6669433
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 26, 20262026-05-26T03:08:57+00:00 2026-05-26T03:08:57+00:00

I want to crawl and save some webpages as HTML. Say, crawl into hundreds

  • 0

I want to crawl and save some webpages as HTML. Say, crawl into hundreds popular websites and simply save their frontpages and the “About” pages.

I’ve looked into many questions, but didn’t find an answer to this from either web crawling or web scraping questions.

What library or tool should I use to build the solution? Or is there even some existing tools that can handle this?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-26T03:08:58+00:00Added an answer on May 26, 2026 at 3:08 am

    There really is no good solution here. You are right as you suspect that Python is probably the best way to start because of it’s incredibly strong support of regular expression.

    In order to implement something like this, strong knowledge of SEO (Search Engine Optimization) would help since effectively optimizing a webpage for search engines tells you how search engines behave. I would start with a site like SEOMoz.

    As far as identifying the “about us” page, you only have 2 options:

    a) For each page get the link of the about us page and feed it to your crawler.

    b) Parse all the links of the page for certain keywords like “about us”, “about” “learn more” or whatever.

    in using option b, be careful as you could get stuck in an infinite loop since a website will link to the same page many times especially if the link is in the header or footer a page may link back to itself even. To avoid this you’ll need to create a list of visited links and make sure not to revisit them.

    Finally, I would recommend having your crawler respect instructions in the robot.txt file and it’s probably a great idea not to follow links marked rel="nofollow" as these are mostly used on external links. Again, learn this and more by reading up on SEO.

    Regards,

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I want to code a perl application that would crawl some websites and collect
I want to use java.net.url to crawl some websites and retrieve some data. I
I've bumped into a problem while working at a project. I want to crawl
I want to crawl some data out of a phpBB forum i'm a member
I want to crawl onyl html pages so when I changed the regular expression
I have a Scrapy project that I want to use to scrape some websites.
I have a url which I want to save into the MySQL database using
I use PHPCrawl for crawl websites but now I want to add a cookie
I want to crawl a website and store the content on my computer for
i have one domain link text i want to know that does google crawl

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.