I’m trying to save files to a directory after scraping them from the web

Question

0

Asked: June 1, 20262026-06-01T21:37:59+00:00 2026-06-01T21:37:59+00:00

I’m trying to save files to a directory after scraping them from the web

0

I’m trying to save files to a directory after scraping them from the web using scrapy. I’m extracting a date from the file and using that as the file name. The problem I’m running into, however, is that some files have the same date, i.e. there are two files that would take the name “June 2, 2009”. So, what I’m looking to do is somehow check whether there is already a file with the same name, and if so, name it something like “June 2, 2009.1” or some such.

The code I’m using is as follows:

def parse_item(self, response):
    self.log('Hi, this is an item page! %s' % response.url) 

    response = response.replace(body=response.body.replace('<br />', '\n'))

    hxs = HtmlXPathSelector(response)

    date = hxs.select("//div[@id='content']").extract()[0]
    dateStrip = re.search(r"([A-Z]*|[A-z][a-z]+)\s\d*\d,\s[0-9]+", date) 
    newDate = dateStrip.group()


    content = hxs.select("//div[@id='content']") 
    content = content.select('string()').extract()[0]

    filename = ("/path/to/a/folder/ %s.txt") % (newDate) 


    with codecs.open(filename, 'w', encoding='utf-8') as output:
        output.write(content)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T21:38:00+00:00

The other answer pointed me in the correct direction by checking into the os tools in python, but I think the way I found is perhaps more straightforward. Reference here How do I check whether a file exists using Python? for more.

The following is the code I came up with:

    existence = os.path.isfile(filename)

    if existence == False:
        with codecs.open(filename, 'w', encoding='utf-8') as output:
            output.write(content)
    else:
        newFilename = ("/path/.../.../- " + '%s' ".1.txt") % (newDate)
        with codecs.open(newFilename, 'w', encoding='utf-8') as output:
            output.write(content)

Edited to Add:

I didn’t like this solution too much, and thought the other answer’s solution was probably better but didn’t quite work. The main part I didn’t like about my solution was that it only worked with 2 files of the same name; if three or four files had the same name the initial problem would occur. The following is what I came up with:

filename = ("/Users/path/" + " " + "title " + '%s' + " " + "-1.txt") % (date) 
filename = str(filename)

    while True:
        os.path.isfile(filename)
        newName = filename.replace(".txt", "", filename)
        newName = str.split(newName)
        newName[-1] = str(int(newName[-1]) + 1)
        filename = " ".join(newName) + ".txt"
        if os.path.isfile(filename) == False:
            with codecs.open(filename, 'w', encoding='utf-8') as output:
                output.write(texts)
            break

It probably isn’t the most elegant and might be kind of a hackish approach, but it has worked so far and seems to have addressed my problem.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to save files to a directory after scraping them from the web

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply