During a parsing process using scrapy I have found this output
[u’TARTARINI AUTO SPA (CENTRALINO SELEZIONE PASSANTE)’],”[u’V. C.BONAZZI\xa043′, u’40013′, u’CASTEL MAGGIORE’]”,[u’0516322411′],[u’info@tartariniauto.it’],[u’CARS (LPG INSTALLERS)’],[u’track.aspx?id=0&url=http://www.tartariniauto.it’]
As you see there are some extra character like
u’ \xa043 ” ‘ [ ]
Which I don’t want .
How can I remove these ??
Besides there are 5 items in this string . I want the string look like this :
item1 , item2 , item3 , item4 , item5
Here is my pipelines.py code
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst, MapCompose, Join
import re
import json
import csv
class InfobelPipeline(object):
def __init__(self):
self.file = csv.writer(open('items.csv','wb'))
def process_item(self, item, spider):
name = item['name']
address = item['address']
phone = item['phone']
email = item['email']
category = item['category']
website = item['website']
self.file.writerow((name,address,phone,email,category,website))
return item
Thanks
The extra characters you’re seeing are unicode strings. You’ll see them a lot if you’re scraping on the web. Common examples include copyright symbols: © unicode point
U+00A9, or trademark symbols ™ unicode pointU+2122.The quickest way to remove them is to try to encode them to ascii and then throw them away if they’re not ascii characters (which none of them are)
As you can see, when you try to decode the symbol to ascii it raises a
UnicodeEncodeErrorbecause the character can’t be represented in ascii. However, if you add theerrors='ignore'keyword argument then it will simply ignore symbols it can’t encode.