I’m making a simple wikipedia page crawler and writing the details to a remote server running redis.
1 The crawler asks the server for a page that needs crawling
2 The crawler loads the page and adds the pages that are found to an internal buffer
3 When the page has finished being parsed the results are sent to the server
how do i do the following:
keep all pages found on the server, with a flag which states if the page has been crawled or not..
e.g
- 1 http://en.wikipedia.org/wiki/MeBeam
- 0 http://en.wikipedia.org/wiki/Chemistry
- 1 http://en.wikipedia.org/wiki/Australia
My question is.
How can i ask redis to give me the first link it has with a state of 0 ( not crawled yet )
and then how I can tell redis to change that state to 1 ( after I crawled it )
You can use list to hold page to process
then you can use lpop to get the first item in the list
To keep track of processed page, you can use a set
And finally gather wether the adress is in the processed set