During a parsing process using scrapy I have found this output [u’TARTARINI AUTO SPA

Question

0

Asked: June 3, 20262026-06-03T00:01:18+00:00 2026-06-03T00:01:18+00:00

During a parsing process using scrapy I have found this output [u’TARTARINI AUTO SPA

0

During a parsing process using scrapy I have found this output

[u’TARTARINI AUTO SPA (CENTRALINO SELEZIONE PASSANTE)’],”[u’V. C.BONAZZI\xa043′, u’40013′, u’CASTEL MAGGIORE’]”,[u’0516322411′],[u’info@tartariniauto.it’],[u’CARS (LPG INSTALLERS)’],[u’track.aspx?id=0&url=http://www.tartariniauto.it’]

As you see there are some extra character like

u’ \xa043 ” ‘ [ ]

Which I don’t want .
How can I remove these ??
Besides there are 5 items in this string . I want the string look like this :

item1 , item2 , item3 , item4 , item5

Here is my pipelines.py code

from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst, MapCompose, Join
import re
import json
import csv

class InfobelPipeline(object):
    def __init__(self):
      self.file = csv.writer(open('items.csv','wb'))
    def process_item(self, item, spider):
      name = item['name']
      address = item['address']
      phone = item['phone']
      email = item['email']
      category = item['category']
      website = item['website']
      self.file.writerow((name,address,phone,email,category,website))
    return item

Thanks

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-03T00:01:19+00:00

The extra characters you’re seeing are unicode strings. You’ll see them a lot if you’re scraping on the web. Common examples include copyright symbols: © unicode point U+00A9, or trademark symbols ™ unicode point U+2122.

The quickest way to remove them is to try to encode them to ascii and then throw them away if they’re not ascii characters (which none of them are)

>>> example = u"Xerox ™ printer"
>>> example
u'Xerox \u2122 printer'
>>> example.encode('ascii')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 6: ordinal 
not in range(128)
>>> example.encode('ascii', errors='ignore')
'Xerox  printer'
>>>

As you can see, when you try to decode the symbol to ascii it raises a UnicodeEncodeError because the character can’t be represented in ascii. However, if you add the errors='ignore' keyword argument then it will simply ignore symbols it can’t encode.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

During a parsing process using scrapy I have found this output [u’TARTARINI AUTO SPA

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply