I need to scrape a website that has a basic folder system, with folders labled with keywords – some of the folders contain text files. I need to scan all the pages (folders) and check the links to new folders, record keywords and files. My main problem ise more abstract: if there is a directory with nested folders and unknown “depth”, what is the most pythonc way to iterate through all of them. [if the “depth” would be known, it would be a really simple for loop). Ideas greatly appriciated.
Share
Here’s a simple spider algorithm. It uses a deque for documents to be processed and a set of already processed documents:
Note that this is iterative and as such can work with arbitrary deep trees.