I have this code site = hxs.select(//h1[@class=’state’]) mydata = site.select(string()).extract() cleaned_mydata = re.sub(ur'(\s)\s+’, ur’\1′,

Question

0

Asked: June 15, 20262026-06-15T03:07:58+00:00 2026-06-15T03:07:58+00:00

I have this code site = hxs.select(//h1[@class=’state’]) mydata = site.select(string()).extract() cleaned_mydata = re.sub(ur'(\s)\s+’, ur’\1′,

0

I have this code

site = hxs.select("//h1[@class='state']")
mydata = site.select("string()").extract()
cleaned_mydata = re.sub(ur'(\s)\s+', ur'\1', mydata[0], flags=re.MULTILINE + re.UNICODE)

        log.msg(str(mydata),level=log.ERROR)
        log.msg(str(cleaned_mydata),level=log.ERROR)

The first output is

ERROR: [u’\r\n 212\r\n jobs containing php in xxxx
\r\n ‘]

other output is

jobs containing php in xxxxxx

regex is also stripping the 212 numeric with it. how can i fix that

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T03:07:59+00:00

The problem is that this regex leaves the first whitespace it finds and strips only the subsequent ones.

This means that

u'\r\n 212\r\n jobs containing php in xxxx \r\n '

becomes

u'\r212\rjobs containing php in xxxx '

When you print this, the 212 will be printed, then a carriage return will return the cursor to the first column, so that the following jobs... will overwrite the 212.

This raises two questions:

You appear to be reading a text file in binary mode (otherwise the \r\n would have been normalized into \ns) – why?
Do you really want the regex to work this way?

Edit:

So, according to your comment, you want to

strip leading and trailing whitespace completely
condense multiple consecutive whitespace characters into a single space (ASCII 32).

Then use

cleaned_mydata = re.sub(r'\s+', ' ', mydata[0].strip())

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have this code site = hxs.select(//h1[@class=’state’]) mydata = site.select(string()).extract() cleaned_mydata = re.sub(ur'(\s)\s+’, ur’\1′,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply