Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7068103
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 28, 20262026-05-28T05:16:58+00:00 2026-05-28T05:16:58+00:00

I have a simple web crawler that starts at root (given url) downloads the

  • 0

I have a simple web crawler that starts at root (given url) downloads the html of the root page then scans for hyperlinks and crawls them. I currently store the html pages in an SQL database. I am currently facing two problems:

  1. It seems like the crawling reaches a bottleneck and isn’t able to crawler faster, I’ve read somewhere that making multi-threaded http requests for pages can make the crawler crawl faster, but I am not sure on how to do this.

  2. The second problem, I need an efficient data structure to store the html pages and be able to run data mining operations on them (currently using an SQL database would like to hear other recommendations)

I am using the .Net framework, C# and MS SQL

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-28T05:16:59+00:00Added an answer on May 28, 2026 at 5:16 am

    So first and foremost, I wouldn’t worry about getting into distributed crawling and storage, because as the name suggests: it requires a decent number of machines for you to get good results. Unless you have a farm of computers, then you won’t be able to really benefit from it. You can build a crawler that gets 300 pages per second and run it on a single computer with 150 Mbps connection.

    The next thing on the list is to determine where is your bottleneck.

    Benchmark Your System

    Try to eliminate MS SQL:

    • Load a list of, say, 1000 URLs that you want to crawl.
    • Benchmark how fast you can crawl them.

    If 1000 URLs doesn’t give you a large enough crawl, then get 10000 URLs or 100k URLs (or if you’re feeling brave, then get the Alexa top 1 million). In any case, try to establish a baseline with as many variables excluded as possible.

    Identify Bottleneck

    After you have your baseline for the crawl speed, then try to determine what’s causing your slowdown. Furthermore, you will need to start using multitherading, because you’re i/o bound and you have a lot of spare time in between fetching pages that you can spend in extracting links and doing other things like working with the database.

    How many pages per second are you getting now? You should try and get more than 10 pages per second.

    Improve Speed

    Obviously, the next step is to tweak your crawler as much as possible:

    • Try to speed up your crawler so it hits the hard limits, such as your bandwidth.
    • I would recommend using asynchronous sockets, since they’re MUCH faster than blocking sockets, WebRequest/HttpWebRequest, etc.
    • Use a faster HTML parsing library: start with HtmlAgilityPack and if you’re feeling brave then try the Majestic12 HTML Parser.
    • Use an embedded database, rather than an SQL database and take advantage of the key/value storage (hash the URL for the key and store the HTML and other relevant data as the value).

    Go Pro!

    If you’ve mastered all of the above, then I would suggest you try to go pro! It’s important that you have a good selection algorithm that mimics PageRank in order to balance freshness and coverage: OPIC is pretty much the latest and greatest in that respect (AKA Adaptive Online Page Importance Computation). If you have the above tools, then you should be able to implement OPIC and run a fairly fast crawler.

    If you’re flexible on the programming language and don’t want to stray too far from C#, then you can try the Java-based enterprise level crawlers such as Nutch. Nutch integrates with Hadoop and all kinds of other highly scalable solutions.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a simple web page that till now didn't need any login. It
I have a simple web mobile app that is calculating values in given fields.
I have a simple web page that has 3 tabs in my main content
I have a crawler that downloads webpages, scrapes specific content and then stores that
I have a simple web service that uses an oracle database. When I test
I have a simple web service that has an API third party developers are
I have a simple web site ( http://www.kousenit.com/twitterfollowervalue ) that computes a quantity based
I have implemented a web crawler that crawls and retrieves content from .edu TLD.
I have a simple web form that feeds into Infusionsoft. Not my call. I
I have a simple web page in which the user will enter some information

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.