I have the following attribute in an xml node I’m reading with libxml. It prints out normally with the accented character if I print out reader.node.
reader = XML::Reader.new(File.open("somefile.xml", "r"))
reader.read
reader.read
...
p reader.node
=> ... Full_Name="Univisión Network - East Feed" ...
If I do this, though, it comes out escaped.
p reader.node["Full_Name"]
=> "Univisi\xC3\xB3n Network - East Feed"
And when I try to convert this value to json laater, I get the following error.
Encoding::UndefinedConversionError: "\xC3" from ASCII-8BIT to UTF-8
Here is the xml line in the document
<?xml version="1.0" encoding="ISO-8859-1"?>
I don’t have control over the xml document itself. How can I get that unicode character back into json, or into a format json understands?
EDIT: Oh, I forgot to mention – this is how it looks in the actual XML document
Full_Name="Univisión Network - East Feed"
EDIT
so i’ve been trying figuring this out for quite some time now. funny thing: your code works without error in ruby 1.8 (at least here). so i think the error has to do with ruby 1.9’s new encoding handling. somehow it cannot figure out that the parsed and read XML is in (libxml’s internal) utf-8 format (the document encoding doesn’t matter here: in 1.8 it works with both iso-8859-1 and utf-8, even with the wrong xml encoding declaration). instead, it treats it as ASCII-8BIT, or BINARY. in other words, it doesn’t know the encoding. which is why
to_jsonfails trying to convert it to utf-8.your easiest way to solve it might be to downgrade to ruby 1.8.
alternatively, your approach of
force_encoding('UTF-8')seems to be reasonable.EDIT END
you can try passing the proper encoding to the reader: