I am parsing an XML file using Nokogiri and Ruby 1.9.2. Everything seems to be working fine until I read the Descriptions (below). The text is being truncated. The input text is:
<Value>The Copthorne Aberdeen enjoys a location proximate to several bars, restaurants and other diversions. This Aberdeen hotel is located on the city’s West End, roughly a mile from the many opportunities to engage in sightseeing or simply shopping the day away. The Aberdeen International Airport is approximately 10 miles from the Copthorne Hotel in Aberdeen.
There are 89 rooms in total at the Copthorne Aberdeen Hotel. Each of the is provided with direct-dial telephone service, trouser presses, coffee and tea makers and a private bath with a bathrobe and toiletries courtesy of the hotel. The rooms are light in color.
The Hotel Copthorne Aberdeen offers its guests a restaurant where they can enjoy their meals in a somewhat formal setting. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.</Value>
But instead I am getting:
g. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.
Notice it starts at g. which is leaving off more than half.
Here is the complete XML file:
<?xml version="1.0" encoding="utf-8"?>
<Hotel>
<HotelID>1040900</HotelID>
<HotelFileName>Copthorne_Hotel_Aberdeen</HotelFileName>
<HotelName>Copthorne Hotel Aberdeen</HotelName>
<CityID>10</CityID>
<CityFileName>Aberdeen</CityFileName>
<CityName>Aberdeen</CityName>
<CountryCode>GB</CountryCode>
<CountryFileName>United_Kingdom</CountryFileName>
<CountryName>United Kingdom</CountryName>
<StarRating>4</StarRating>
<Latitude>57.146068572998</Latitude>
<Longitude>-2.111680030823</Longitude>
<Popularity>1</Popularity>
<Address>122 Huntly Street</Address>
<CurrencyCode>GBP</CurrencyCode>
<LowRate>36.8354</LowRate>
<Facilities>1|2|3|5|6|8|10|11|15|17|18|19|20|22|27|29|30|34|36|39|40|41|43|45|47|49|51|53|55|56|60|62|140|154|209</Facilities>
<NumberOfReviews>239</NumberOfReviews>
<OverallRating>3.95</OverallRating>
<CleanlinessRating>3.98</CleanlinessRating>
<ServiceRating>3.98</ServiceRating>
<FacilitiesRating>3.83</FacilitiesRating>
<LocationRating>4.06</LocationRating>
<DiningRating>3.93</DiningRating>
<RoomsRating>3.68</RoomsRating>
<PropertyType>0</PropertyType>
<ChainID>92</ChainID>
<Checkin>14</Checkin>
<Checkout>12</Checkout>
<Images>
<Image>19305754</Image>
<Image>19305755</Image>
<Image>19305756</Image>
<Image>19305757</Image>
<Image>19305758</Image>
<Image>19305759</Image>
<Image>19305760</Image>
<Image>19305761</Image>
<Image>19305762</Image>
<Image>19305763</Image>
<Image>19305764</Image>
<Image>19305765</Image>
<Image>19305766</Image>
<Image>19305767</Image>
<Image>37102984</Image>
</Images>
<Descriptions>
<Description>
<Name>General Description</Name>
<Value>The Copthorne Aberdeen enjoys a location proximate to several bars, restaurants and other diversions. This Aberdeen hotel is located on the city’s West End, roughly a mile from the many opportunities to engage in sightseeing or simply shopping the day away. The Aberdeen International Airport is approximately 10 miles from the Copthorne Hotel in Aberdeen.
There are 89 rooms in total at the Copthorne Aberdeen Hotel. Each of the is provided with direct-dial telephone service, trouser presses, coffee and tea makers and a private bath with a bathrobe and toiletries courtesy of the hotel. The rooms are light in color.
The Hotel Copthorne Aberdeen offers its guests a restaurant where they can enjoy their meals in a somewhat formal setting. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.</Value>
</Description>
<Description>
<Name>LocationDescription</Name>
<Value>Aberdeen's premier four star hotel located in the city centre just off Union Street and the main business and entertainment areas. Within 10 minutes journey of Aberdeen Railway Station and only 10-20 minutes journey from International Airport.</Value>
</Description>
</Descriptions>
</Hotel>
And here is my Ruby program:
require 'rubygems'
require 'nokogiri'
require 'ap'
include Nokogiri
class Hotel < Nokogiri::XML::SAX::Document
def initialize
@h = {}
@h["Images"] = Array.new([])
@h["Descriptions"] = Array.new([])
@desc = {}
end
def end_document
ap @h
puts "Finished..."
end
def start_element(element, attributes = [])
@element = element
@desc = {} if element == "Description"
end
def end_element(element, attributes = [])
@h["Images"] << @characters if element == "Image"
@desc["Name"] = @characters if element == "Name"
if element == "Value"
@desc["Value"] = @characters
@h["Descriptions"] << @desc
end
@h[element] = @characters unless %w(Images Image Descriptions Description Hotel Name Value).include? element
end
def characters(string)
@characters = string
end
end
# Create a new parser
parser = Nokogiri::XML::SAX::Parser.new(Hotel.new)
# Feed the parser some XML
parser.parse(File.open("/Users/cbmeeks/Projects/shared/data/text/HotelDatabase_EN/00/1040900.xml", 'rb'))
Thanks
I stripped down the XML because it had a lot of unnecessary nodes for the problem. Here’s a sample of how I go after text:
With a sample of the output:
This purposely only goes after the
Valuenodes. It’d be simple to modify the sample to grab the image nodes too.Now, a couple questions: Why use SAX mode? Is the incoming XML bigger than can reasonably fit into the RAM of your host? If not, use DOM as it’s much easier to use.
When I ran it the first time, Ruby told me
invalid multibyte char (US-ASCII), meaning there’s something in the XML it didn’t like. I fixed that by adding the# encodingline. I’m using Ruby 1.9.2, which makes it easier to deal with such things.I’m using CSS accessors for the search. Nokogiri allows XPath and CSS, so you’re free to indulge your XML-parsing heart’s desire however you want.