I have a very simply formatted XML document that I would like to translate

Question

0

Asked: May 19, 20262026-05-19T12:49:28+00:00 2026-05-19T12:49:28+00:00

I have a very simply formatted XML document that I would like to translate

0

I have a very simply formatted XML document that I would like to translate into TSV suitable for an import into Hive. The formatting of this document is straightforward:

<root>
   <row>
      <ID>0</ID>
      <ParentID>0</ParentID>
      <Url></Url>
      <Title></Title>
      <Text></Text>
      <Username></Username>
      <Points>0</Points>
      <Type>0</Type>
      <Timestamp></Timestamp>
      <CommentCount>0</CommentCount>
   </row>
</root>

I have a working Ruby script that will translate a document formatted as above into TSVs properly. That’s here:

require "rubygems"
require "crack"

xml = Crack::XML.parse(File.read("sample.xml"))

xml['root']['row'].each{ |i|
  puts "#{i['ID']}      #{i['ParentID']}        #{i['Url']}     #{i['Title']}..." 
}

Unfortunately, the files I need to translate are substantially larger than this script can handle (> 1 GB).

Which is where Hadoop comes in. The simplest solution is probably to write a MapReduce job in Java, but that’s not an option given that I lack Java skills. So I wanted to write the a mapper script in either Python or Ruby which I am far from expert in, but can at least navigate.

My plan then was to do the following:

use StreamXmlRecordReader to parse the file record by record
map the deserialization using crack
reduce it with a simple regurgitation of the elements spaced by tabs

This approach has failed consistently, however. I’ve used a variety of Ruby/Wukong scripts with no success. Here’s one based off the article here:

#!/usr/bin/env ruby

require 'rubygems'
require 'crack'

xml = nil
STDIN.each_line do |line|
  puts |line|
  line.strip!

  if line.include?("<row")
    xml = Crack::XML.parse(line)
    xml['root']['row'].each{ |i|
      puts "#{i['ID']}      #{i['ParentID']}        #{i['Url']}..."     
  else
    puts 'no line'
  end

  if line.include?("</root>")
    puts 'EOF'
  end
end

This and other jobs fail as follows:

hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2+737.jar -input /hackernews/Datasets/sample.xml -output out -mapper mapper.rb -inputreader "StreamXmlRecordReader,begin=<row,end=</row>"
packageJobJar: [/var/lib/hadoop-0.20/cache/sog/hadoop-unjar1519776523448982201/] [] /tmp/streamjob2858887307771024146.jar tmpDir=null
11/01/14 17:29:17 INFO mapred.FileInputFormat: Total input paths to process : 1
11/01/14 17:29:17 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop-0.20/cache/sog/mapred/local]
11/01/14 17:29:17 INFO streaming.StreamJob: Running job: job_201101141647_0001
11/01/14 17:29:17 INFO streaming.StreamJob: To kill this job, run:
11/01/14 17:29:17 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job  -Dmapred.job.tracker=localhost:8021 -kill job_201101141647_0001
11/01/14 17:29:17 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201101141647_0001
11/01/14 17:29:18 INFO streaming.StreamJob:  map 0%  reduce 0%
11/01/14 17:30:05 INFO streaming.StreamJob:  map 100%  reduce 100%
11/01/14 17:30:05 INFO streaming.StreamJob: To kill this job, run:
11/01/14 17:30:05 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job  -Dmapred.job.tracker=localhost:8021 -kill job_201101141647_0001
11/01/14 17:30:05 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201101141647_0001
11/01/14 17:30:05 ERROR streaming.StreamJob: Job not Successful!
11/01/14 17:30:05 INFO streaming.StreamJob: killJob...
Streaming Command Failed!

The first problem is that I can’t tell where I’m failing: my script, or StreamXmlRecordReader.

The second problem is that I’m told by a gracious and helpful expert that because StreamXmlRecordReader doesn’t produce an additional record delimiter, this approach probably isn’t going to work, and that I’ll need to read in single lines, grep for row, stack up everything until you get /row, and then parse it.

Is this the simplest approach, and if so, how I might best accomplish that?

Performance isn’t a huge issue, because these files are batch processed every few weeks or so, just in case that helps.