I am writing a web crawler that processes multiple URLs at the same time and works in the following way:
-
It gets a URL from a list of URLs included in seed_list.txt,
-
It crawls it and write the data into data.txt;
just like how most of web crawlers work.
When I make it single-threaded, I can get the data in data.txt in the same order with that of the URLs in seed_list.txt, but when it’s multi-threaded, I don’t seem able to control it, as each thread writes the data to data.txt once it is finished.
Is there a way I can make my web crawler multi-threaded but keep the original order?
Thank you very much!
@Lance, Ignacio, and Maksym,
thank you all for your help – your answers definitely point me in the right direction.
You could create a class that has an index number of the line from
seed_list.txt, the URL, and where the data from the web. An object of this type can be created with the line number and URL, then it is passed to the worker thread which will put the data into the object, and then the object is passed to a write thread which will order the objects by the line number and output the data as necessary.