I’m working on a little web crawler…
I’m having a problem with accents, for example if a web page has a an Apuntó word when I puts the console (cmd.exe) shows me apunt├│, I thought it was something related with the cmd encoding but when I printed that to a file I’m getting the exact apunt├│ word…
If I do a puts "apuntó" the output is correct I get apuntó
Any idea what’s happening?
thanks!
It looks like you need to go and learn about character encodings. A good place to start would be Joel Spolsky’s article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). After that I’d recommend James Gray’s series of blog post on character encoding in Ruby.
In your case what’s happening is this. When your crawler fetches the web page, the word
Apuntóis being fetched as the byteswhich is the UTF-8 encoding of the word. In this encoding the letter
óis encoded as two bytes,0xc3and0xb3. Your software however is unaware of the encoding, and assumes the bytes represent characters in the default character set, which looks like codepage 437, so they appear as├for0xc3and│for0xb3.The way to handle this is to ensure that every time any text enters your program from the outside you know the encoding that text is in, and interpret it appropriately. In the case of web pages this can be a liitle tricky since the encoding can be specified in a few places, including in the page itself.
It should become clearer what you need to do in your case when you know more about character encodings.