I have this code
site = hxs.select("//h1[@class='state']")
mydata = site.select("string()").extract()
cleaned_mydata = re.sub(ur'(\s)\s+', ur'\1', mydata[0], flags=re.MULTILINE + re.UNICODE)
log.msg(str(mydata),level=log.ERROR)
log.msg(str(cleaned_mydata),level=log.ERROR)
The first output is
ERROR: [u’\r\n 212\r\n jobs containing php in xxxx
\r\n ‘]
other output is
jobs containing php in xxxxxx
regex is also stripping the 212 numeric with it. how can i fix that
The problem is that this regex leaves the first whitespace it finds and strips only the subsequent ones.
This means that
becomes
When you print this, the
212will be printed, then a carriage return will return the cursor to the first column, so that the followingjobs...will overwrite the212.This raises two questions:
\r\nwould have been normalized into\ns) – why?Edit:
So, according to your comment, you want to
Then use