Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6608225
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 25, 20262026-05-25T19:35:13+00:00 2026-05-25T19:35:13+00:00

I have a document that looks similar to this ( note the title ):

  • 0

I have a document that looks similar to this (note the title):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml">
  <head>
    <title>Sã�ng Title</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  </head>
  <body>
    <div id="container">
      Some Text
    </div>
  </body>
</html>

When i get this document using Nokogiri using this code:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open(url).read)

The result from Nokogiri is this:

ruby-1.9.2-p290 :060 > pp doc
#(Document:0x82e5ed2c {
  name = "document",
  children = [
    #(DTD:0x82e5e994 { name = "HTML" }),
    #(Element:0x82e5e0c0 {
      name = "html",
      attributes = [
        #(Attr:0x82e5e05c {
          name = "xmlns",
          value = "http://www.w3.org/1999/xhtml"
          }),
        #(Attr:0x82e5e048 {
          name = "xmlns:fb",
          value = "http://www.facebook.com/2008/fbml"
          })],
      children = [
        #(Element:0x82e5d8dc {
          name = "head",
          children = [
            #(Element:0x82e5d6d4 {
              name = "title",
              children = [ #(Text "Sã")]
              })]
          })]
      })]
  })

To me, it looks like the character AFTER “Sã” causes nokogiri to just choke and think the document has ended. As you can see the #content div is not included at all.

Anyone know how to deal with this situation?

This is killing me… Thank you!!

Edit:
Upon further research i’ve found the actual character causing the choke is a unicode null char “\u0000”.

Right now i’m thinking i can do something like this:

page_content = open(url).read
# Remove null character
page_content.gsub!(/\u0000/, '')
Nokogiri::HTML(page_content)
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-25T19:35:14+00:00Added an answer on May 25, 2026 at 7:35 pm

    Are you sure that the character after Sã is a valid UTF-8 character?

    Added There are illegal UTF-8 character sequences. To decode UTF-8 manually, try this decoder. You can enter the incoming hex and it will tell you what each individual byte means.

    A good overview of UTF-8. UTF-8 code chart

    Re: Removing the null character. Your code looks ok, try it out! But in addition, I’d investigate the source of the null in your incoming datastream.

    Also, the binary UTF-8 of your Original Post is, in fact, the unknown character symbol–not your original datastream. Here is what is in your post:

    53 C3 A3 EF BF BD 6E 67
    

    Here is the decoding:

    U+0053 LATIN CAPITAL LETTER S character
    U+00E3 LATIN SMALL LETTER A WITH TILDE character (&#x00E3;)
    U+FFFD REPLACEMENT CHARACTER character (&#xFFFD;)  # this is the char used when
                                                       # the orig is not understood.
    U+006E LATIN SMALL LETTER N character
    U+0067 LATIN SMALL LETTER G character
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have an XML document (an InfoPath form) that looks similar to this: <my:ClientMaintenance
I have a text document that looks similar to this: R.D. P.N. X Y
Okay, if I have a html document that looks a bit like this: <div
I have a frame page.... <!DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Frameset//EN http://www.w3.org/TR/html4/frameset.dtd> <html>
I have an xml document that looks like this. <foo> <bar type=artist/> Bob Marley
Say I have a xml document that looks like this <foo> <bar id=9 />
So I have a DOM document that looks essentially like this <categories> <category id=1/>
So i have an XML document that looks like this: <?xml version=1.0 encoding=UTF-8?> <gesmes:Envelope
I am trying to parse some XML that looks similar to this: <document> <headings>
I have something that looks like the following document structure: public class Document {

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.