i’m making a crawler to get text html inside, i’m using beautifulsoup. when I

Question

0

Asked: May 26, 20262026-05-26T17:35:47+00:00 2026-05-26T17:35:47+00:00

i’m making a crawler to get text html inside, i’m using beautifulsoup. when I

0

i’m making a crawler to get text html inside, i’m using beautifulsoup.

when I open the url using urllib2, this library converts automatically the html that was using portuguese accents like ” ã ó é õ ” in another characters like these “a³ a¡ a´a§”

what I want is just get the words without accents

contrã¡rio -> contrario

I tried to use this algoritm, bu this one just works when the text uses words like these “olá coração contrário”

   def strip_accents(s):
      return ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T17:35:47+00:00

Firstly, you have to ensure that your crawler returns HTML that is unicode text (Eg. Scrapy has a method response.body_as_unicode() that does exactly this)

Once you have unicode text that you cant make sense of, the step of going from unicode text to equivalent ascii text lies here – http://pypi.python.org/pypi/Unidecode/0.04.1

from unidecode import unidecode
print unidecode(u"\u5317\u4EB0")

The output is “Bei Jing”

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

i’m making a crawler to get text html inside, i’m using beautifulsoup. when I

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply