Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7987539
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 4, 20262026-06-04T12:10:42+00:00 2026-06-04T12:10:42+00:00

I am creating a simple web spider. All it does is accept a URL,

  • 0

I am creating a simple web spider. All it does is accept a URL, download the HTML and extract the remaining URLs. It then repeats the process for each new URL. I’m also making sure I don’t visit the same URL twice and I am limiting the number of concurrent downloads.

After every unique URL has been exhausted (could run for days, weeks or till after I’m dead and gone), I would like to perform an action, like updating the UI or simply exiting the application.

The problem is, I don’t know how to detect when the last thread has finished running.

Has this threading problem been solved? Am I looking at the problem wrong?

One thought was to keep each thread alive until all of its children finished (join). The problem is that the number of threads grow exponentially. For such a long-running process, it would quickly exhaust OS resources.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-04T12:10:44+00:00Added an answer on June 4, 2026 at 12:10 pm

    I’m not sure what language we are talking about so I’ll speak generically.

    You need a data structure for each URL that keeps track of how many “children” pages get generated from it. Whenever an URL is being spidered, it will have a “parent” data structure. Whenever a new page is found, is added to the parent’s tree count. Whenever a page is spidered, the parent’s tree count is decremented. This will need to be done in a synchronized manner since multiple threads will be updating it.

    You may actually want to save the entire URL structure. The root URL “http://foo.x/” has links to “/1.html” and “/2.html” so it’s children-count is 2. The root URL has a null parent and “1” and “2” have a parent of the root. When “1.html” is spidered then the root’s children-count is decremented to 1. But if there are 3 links inside of “1.html” then the root’s count gets incremented to 4. If you want to keep track of the tree then “1.html” children count goes to 3, etc.. Then when one of the children of “1.html” gets spidered, the count for “1.html” goes to 2 and the root URL’s count goes to 3.

    You certainly do not want to be keeping the threads around and then joining later as you mention — your thread count will explode. You should use a thread-pool and submit URLs to spidered, each with their associated node in the URL tree, to the pool so they can be spidered by the same threads.

    When an URL is spidered, and the children count goes to 0 then you know that you have spidered the whole tree and the URL can be removed from the working-list and moved to the done-list. Again, these lists will need to be synchronized since multiple threads will be operating on them.

    Hope this helps somewhat.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

If I am creating a simple web scraper (from root url, grab all links,
I am interested in creating a simple web application that will take in user
I am creating a very simple web service and I don't want to bother
I'm creating a simple pastebin web application on top of Symfony2, but I can't
I am creating a simple web application. I need to get reference to ServletContext
I am using JBoss 4.0 for creating a simple Web Service using the @WebService
I am creating a simple web app for TFS2008, so I am using the
I'm creating a simple web server in Ruby, which display's the text LOLZ in
I am creating a simple web API that returns JSON. It will perform simple
I'm creating a simple web application with which a user may author a message

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.