I’m trying to scrape pages using a Scrapy spider and then save those pages

Question

0

Asked: May 27, 20262026-05-27T14:17:37+00:00 2026-05-27T14:17:37+00:00

I’m trying to scrape pages using a Scrapy spider and then save those pages

0

I’m trying to scrape pages using a Scrapy spider and then save those pages into a .txt file in a readable form. The code I’m using to do this is:

def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url) 

        hxs = HtmlXPathSelector(response)

        title = hxs.select('/html/head/title/text()').extract() 
        content = hxs.select('//*[@id="content"]').extract() 

        texts = "%s\n\n%s" % (title, content) 

        soup = BeautifulSoup(''.join(texts)) 

        strip = ''.join(BeautifulSoup(pretty).findAll(text=True)) 

        filename = ("/Users/username/path/output/Hansard-" + '%s'".txt") % (title) 
        filly = open(filename, "w")
        filly.write(strip)

I’ve combined BeautifulSoup here because the body text contains a lot of HTML that I don’t want in the final product (primarily links), so I use BS to strip out the HTML and leave only the text that is of interest.

This gives me output that looks like

[u"School, Chandler's Ford (Hansard, 30 November 1961)"]

[u'

 \n      \n

  HC Deb 30 November 1961 vol 650 cc608-9

 \n

  608

 \n

  \n


  \n

   \n

    \xa7

   \n

    28.

   \n


     Dr. King


   \n

    \n            asked the Minister of Education what is the price at which the Hampshire education authority is acquiring the site for the erection of Oakmount Secondary School, Chandler\'s Ford; and why he refused permission to acquire this site in 1954.\n

   \n

  \n

 \n      \n

  \n


  \n

   \n

    \xa7

   \n


     Sir D. Eccles


   \n

    \n            I understand that the authority has paid \xa375,000 for this site.\n            \n

While I want the output to look like:

    School, Chandler's Ford (Hansard, 30 November 1961)

          HC Deb 30 November 1961 vol 650 cc608-9

          608

            28.

Dr. King asked the Minister of Education what is the price at which the Hampshire education authority is acquiring the site for the erection of Oakmount Secondary School, Chandler's Ford; and why he refused permission to acquire this site in 1954.

Sir D. Eccles I understand that the authority has paid £375,000 for this site.

So I’m basically looking for how to remove the newline indicators \n, tighten everything up, and convert any special characters to their normal format.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T14:17:38+00:00

My answer in comments for code:

import re
import codecs

#...
#...
#extract() returns list, so you need to take first element
title = hxs.select('/html/head/title/text()').extract() [0]
content = hxs.select('//*[@id="content"]')
#instead of using BeautifulSoup for this task, you can use folowing
content = content.select('string()').extract()[0]

#simply delete duplicating spaces and newlines, maybe you need to adjust this expression
cleaned_content = re.sub(ur'(\s)\s+', ur'\1', content, flags=re.MULTILINE + re.UNICODE)

texts = "%s\n\n%s" % (title, cleaned_content) 

#look's like typo in filename creation
#filename ....

#and my preferable way to write file with encoding
with codecs.open(filename, 'w', encoding='utf-8') as output:
    output.write(texts)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to scrape pages using a Scrapy spider and then save those pages

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply