I have written a spider that crawls through a folder named fid and extracts the names of all the sub-folders as a link. Now the problem is that each of these sub-folders have an html page inside them and i want to extract the names of all these html files and add to the current “start_urls”, so that i can scrape out required information from all these html pages. I have tried:
os.listdir()
glob.glob()
but none of these worked. Please help me with this.
One stdlib approach is using
os.walkin combination withfnmatch: