I am trying to parse the number of results from the HTML code returned

Question

0

Asked: June 15, 20262026-06-15T09:41:15+00:00 2026-06-15T09:41:15+00:00

I am trying to parse the number of results from the HTML code returned

0

I am trying to parse the number of results from the HTML code returned from a search query, however when I use find/index() it seems to return the wrong position. The string I am searching for has an accent, so I try searching for it in Unicode form.

A snippet of the HTML code being parsed:

<div id="WPaging_total">
  Aproximádamente 37 resultados.
</div>

and I search for it like this:

str_start = html.index(u'Aproxim\xe1damente ')
str_end = html.find(' resultados', str_start + 16)#len('Aproxim\xe1damente ')==16
print html[str_start+16:str_end] #works by changing 16 to 24

The print statement returns:

damente 37

When the expected result is:

It seems str_start isn’t starting at the beginning of the string I am searching for, instead 8 positions back.

print html[str_start:str_start+5]

Outputs:

l">

The problem is hard to replicate though because it doesn’t happen when using the code snippet, only when searching inside the entire HTML string. I could simply change str_start+16 to str_start+24 to get it working as intended, however that doesn’t help me understand the problem. Is it a Unicode issue? Hopefully someone can shed some light on the issue.

Thank you.

LINK:
http://guiasamarillas.com.mx/buscador/?actividad=Chedraui&localidad=&id_page=1

SAMPLE CODE:

from urllib2 import Request, urlopen

url = 'http://guiasamarillas.com.mx/buscador/?actividad=Chedraui&localidad=&id_page=1'
post = None
headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2)'}          
req = Request(url, post, headers)
conn = urlopen(req)

html = conn.read()

str_start = html.index(u'Aproxim\xe1damente ')
str_end = html.find(' resultados', str_start + 16)
print html[str_start+16:str_end]

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T09:41:16+00:00

Your problem ultimately boils down to the fact that in Python 2.x, the str type represents a sequence of bytes while the unicode type represents a sequence of characters. Because one character can be encoded by multiple bytes, that means that the length of a unicode-type representation of a string may differ from the length of a str-type representation of the same string, and, in the same way, an index on a unicode representation of the string may point to a different part of the text than the same index on the str representation.

What’s happening is that when you do str_start = html.index(u'Aproxim\xe1damente '), Python automatically decodes the html variable, assuming that it is encoded in utf-8. (Well, actually, on my PC I simply get a UnicodeDecodeError when I try to execute that line. Some of our system settings relating to text encoding must be different.) Consequently, if str_start is n then that means that u'Aproxim\xe1damente ' appears at the nth character of the HTML. However, when you use it as a slice index later to try and get content after the (n+16)th character, what you’re actually getting is stuff after the (n+16)th byte, which in this case is not equivalent because earlier content of the page featured accented characters that take up 2 bytes when encoded in utf-8.

The best solution would be simply to convert the html to unicode when you receive it. This small modification to your sample code will do what you want with no errors or weird behaviour:

from urllib2 import Request, urlopen

url = 'http://guiasamarillas.com.mx/buscador/?actividad=Chedraui&localidad=&id_page=1'
post = None
headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2)'}          
req = Request(url, post, headers)
conn = urlopen(req)

html = conn.read().decode('utf-8')

str_start = html.index(u'Aproxim\xe1damente ')
str_end = html.find(' resultados', str_start + 16)
print html[str_start+16:str_end]

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to parse the number of results from the HTML code returned

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply