I’m trying to parse an HTML page with Nokogiri but I’m having some issues with text. Mainly, I cannot get rid of unwanted chars. While parsing, when I obtain a String I always try to clean it as much as possible. I try to convert nonprintable chars to unique spaces. I use this method without success after a lot of modifications:
def clear_string(str)
CGI::unescapeHTML(str).gsub(/\s+/mu," ").strip
end
For instance, supose this HTML fragment (copy-pasted from http://www.gisa.cat/gisa/servlet/HomeLicitation?licitationID=1061525)
<tr>
<td><span class="linkred2">Tramitació:</span></td>
<td> ordinària </td>
</tr>
Some intermediate example outputs showed by Netbeans 7.0 using Nokogiri and clear_string (the method defined above)
row.at("td[1]").text # => "Tramitació:"
row.at("td[2]").text # => " ordinària "
clear_string(row.at("td[2]").text) # => " ordinària"
row.at("td[2]").text.scan(/./mu) # => ["\302\240", "o", "r", "d", "i", "n", "\303\240", "r", "i", "a", " "]
I don’t know why strip doesn’t get rid of first spaces. Moreover, the parsing result after applying clear_string, is dumped into a yaml file using YAML::dump. Its contents are respectively, for both texts:
"Tramitaci\xC3\xB3:"
!binary |
wqBvcmRpbsOgcmlh
The first one seems barely OK, but I don’t know how to fix the second case.
One way to translate characters from one character set to another is to use
Iconv. For example if what you are looking for is just converting UTF8 to ASCII you could do something like this:The
TRANSLITswitch tellsIconvto try and transliterate (approximately match) unconvertable characters. If you instead want to completely ignore unconvertable characters then you can use theIGNOREswitch:Note that
Iconvwill throw an exception withTRANSLITif it finds something it can’t convert. For that you can combineIGNOREandTRANSLITlike so: