Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7092709
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 28, 20262026-05-28T08:20:54+00:00 2026-05-28T08:20:54+00:00

I have written a spider that crawls through a folder named fid and extracts

  • 0

I have written a spider that crawls through a folder named fid and extracts the names of all the sub-folders as a link. Now the problem is that each of these sub-folders have an html page inside them and i want to extract the names of all these html files and add to the current “start_urls”, so that i can scrape out required information from all these html pages. I have tried:

os.listdir()
glob.glob()

but none of these worked. Please help me with this.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-28T08:20:55+00:00Added an answer on May 28, 2026 at 8:20 am

    One stdlib approach is using os.walk in combination with fnmatch:

    import fnmatch
    import os
    
    start_urls = []
    
    for root, dirnames, filenames in os.walk('/start/dir/'):
        for filename in fnmatch.filter(filenames, '*.html'):
            start_urls.append(os.path.join(root, filename))
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

currently I have a spider written in Java that logs into a supplier website
I have spider that I have written using the Scrapy framework. I am having
I've written a Scrapy spider that extracts text from a page. The spider parses
I have written a packet reader that uses libpcap to read a capture file.
I have written a module that uses vqmod for opencart. How can I check
I have written a web application that uses Microsoft's built-in membership authentication that is
I have written a script that is to collect hardware and software information from
I have written a scrapy spider to scrape out some html tags. Now the
I have written the following code. I know that a higher order function is
I have written a WCF service with the REST template that has the defaultOutgoingResponseFormat

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.