I’m trying to parse an HTML page with Nokogiri but I’m having some issues

Question

0

Asked: May 24, 20262026-05-24T18:41:15+00:00 2026-05-24T18:41:15+00:00

I’m trying to parse an HTML page with Nokogiri but I’m having some issues

0

I’m trying to parse an HTML page with Nokogiri but I’m having some issues with text. Mainly, I cannot get rid of unwanted chars. While parsing, when I obtain a String I always try to clean it as much as possible. I try to convert nonprintable chars to unique spaces. I use this method without success after a lot of modifications:

def clear_string(str)
  CGI::unescapeHTML(str).gsub(/\s+/mu," ").strip
end

For instance, supose this HTML fragment (copy-pasted from http://www.gisa.cat/gisa/servlet/HomeLicitation?licitationID=1061525)

<tr>
    <td><span class="linkred2">Tramitaci&oacute;:</span></td>
    <td>&nbsp;ordinària </td>
</tr>

Some intermediate example outputs showed by Netbeans 7.0 using Nokogiri and clear_string (the method defined above)

row.at("td[1]").text # => "Tramitació:"
row.at("td[2]").text # => " ordinària "
clear_string(row.at("td[2]").text) # => " ordinària"
row.at("td[2]").text.scan(/./mu) # => ["\302\240", "o", "r", "d", "i", "n", "\303\240", "r", "i", "a", " "]

I don’t know why strip doesn’t get rid of first spaces. Moreover, the parsing result after applying clear_string, is dumped into a yaml file using YAML::dump. Its contents are respectively, for both texts:

"Tramitaci\xC3\xB3:"
!binary |
  wqBvcmRpbsOgcmlh

The first one seems barely OK, but I don’t know how to fix the second case.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-24T18:41:16+00:00

One way to translate characters from one character set to another is to use Iconv. For example if what you are looking for is just converting UTF8 to ASCII you could do something like this:

require 'iconv'

s = "ordinària"
Iconv.conv('ASCII//TRANSLIT', 'UTF8', s)
=> "ordinaria"

The TRANSLIT switch tells Iconv to try and transliterate (approximately match) unconvertable characters. If you instead want to completely ignore unconvertable characters then you can use the IGNORE switch:

Iconv.conv('ASCII//IGNORE', 'UTF8', s)
=> "ordinria"

Note that Iconv will throw an exception with TRANSLIT if it finds something it can’t convert. For that you can combine IGNORE and TRANSLIT like so:

Iconv.conv('ASCII//TRANSLIT//IGNORE', 'UTF8', s)
=> "ordinaria"

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to parse an HTML page with Nokogiri but I’m having some issues

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply