I have a document that looks similar to this ( note the title ):

Question

0

Asked: May 25, 20262026-05-25T19:35:13+00:00 2026-05-25T19:35:13+00:00

I have a document that looks similar to this ( note the title ):

0

I have a document that looks similar to this (note the title):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml">
  <head>
    <title>Sã�ng Title</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  </head>
  <body>
    <div id="container">
      Some Text
    </div>
  </body>
</html>

When i get this document using Nokogiri using this code:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open(url).read)

The result from Nokogiri is this:

ruby-1.9.2-p290 :060 > pp doc
#(Document:0x82e5ed2c {
  name = "document",
  children = [
    #(DTD:0x82e5e994 { name = "HTML" }),
    #(Element:0x82e5e0c0 {
      name = "html",
      attributes = [
        #(Attr:0x82e5e05c {
          name = "xmlns",
          value = "http://www.w3.org/1999/xhtml"
          }),
        #(Attr:0x82e5e048 {
          name = "xmlns:fb",
          value = "http://www.facebook.com/2008/fbml"
          })],
      children = [
        #(Element:0x82e5d8dc {
          name = "head",
          children = [
            #(Element:0x82e5d6d4 {
              name = "title",
              children = [ #(Text "Sã")]
              })]
          })]
      })]
  })

To me, it looks like the character AFTER “Sã” causes nokogiri to just choke and think the document has ended. As you can see the #content div is not included at all.

Anyone know how to deal with this situation?

This is killing me… Thank you!!

Edit:
Upon further research i’ve found the actual character causing the choke is a unicode null char “\u0000”.

Right now i’m thinking i can do something like this:

page_content = open(url).read
# Remove null character
page_content.gsub!(/\u0000/, '')
Nokogiri::HTML(page_content)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T19:35:14+00:00

Are you sure that the character after Sã is a valid UTF-8 character?

Added There are illegal UTF-8 character sequences. To decode UTF-8 manually, try this decoder. You can enter the incoming hex and it will tell you what each individual byte means.

A good overview of UTF-8. UTF-8 code chart

Re: Removing the null character. Your code looks ok, try it out! But in addition, I’d investigate the source of the null in your incoming datastream.

Also, the binary UTF-8 of your Original Post is, in fact, the unknown character symbol–not your original datastream. Here is what is in your post:

53 C3 A3 EF BF BD 6E 67

Here is the decoding:

U+0053 LATIN CAPITAL LETTER S character
U+00E3 LATIN SMALL LETTER A WITH TILDE character (&#x00E3;)
U+FFFD REPLACEMENT CHARACTER character (&#xFFFD;)  # this is the char used when
                                                   # the orig is not understood.
U+006E LATIN SMALL LETTER N character
U+0067 LATIN SMALL LETTER G character

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a document that looks similar to this ( note the title ):

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply