I’m using Nokogiri to parse a webpage which contains special characters, however the special

Question

0

Asked: May 19, 20262026-05-19T11:18:38+00:00 2026-05-19T11:18:38+00:00

I’m using Nokogiri to parse a webpage which contains special characters, however the special

0

I’m using Nokogiri to parse a webpage which contains special characters, however the special characters do not get parsed correctly- showing up as “genealÃ³gica”

doc=Nokogiri::HTML(open("#{BASE_URL}search=#{book}#{chapters}&version=NVI")).css('.result-text-style-normal')
doc.css('.footnotes').remove
doc.css('h4').remove
doc

any ideas how I could fix this?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-19T11:18:38+00:00

EDIT: I did a bit more work looking at the page, how you’re trying to process it, and think this works better. I changed how you process the page also, because it wasn’t as clear as how I like seeing it, for maintainability and readability.

require 'addressable/uri'
require 'nokogiri'
require 'open-uri'

def get_chapter(base_url, params={})
  uri = Addressable::URI.parse(base_url)
  uri.query_values = params

  doc = Nokogiri::XML(open(uri.to_s))
  doc.encoding = 'UTF-8'

  div = doc.at_css('.result-text-style-normal')
  div.css('.footnotes').remove
  div.css('h4').remove

  doc
end

page = get_chapter('http://www.biblegateway.com/passage/', :search => 'Mateo1-2', :version => 'NVI')
puts page.content

Rather than build a URL like you were, I prefer seeing it passed in as chunks, with the base URL and parameters split. I build the URI using the Addressable gem, which is my go-to for munging URLs. Ruby’s built-in URI is having some growing pains right now, related to encoding of parameters.

The document at the far end of the URL you gave says it is XHTML, so it should meet the XHTML specs. You can parse XHTML using Nokogiri::HTML() but I think you get better results using Nokogiri::XML(), which is more strict.

To give Nokogiri an additional nudge in the right direction for parsing the content, I add:

doc.encoding = 'UTF-8'

I prefer finding the desired div and assigning it to a temporary variable, and working from that point, rather than doing it chained to the parse step like you did. It’s a bit more idiomatic and readable this way because we’re dealing with chunks of the document.

Running the code outputs what appears to be nice and clean content. There is some embedded Javascript, but that is unavoidable because Javascript is treated as text inside the <script> tags. That isn’t an issue if you are presenting the HTML for a browser to render.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m using Nokogiri to parse a webpage which contains special characters, however the special

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply