I am using hadoop to process an xml file,so i had written mapper file

Question

0

Editorial Team

Asked: June 14, 20262026-06-14T04:32:31+00:00 2026-06-14T04:32:31+00:00

I am using hadoop to process an xml file,so i had written mapper file

0

I am using hadoop to process an xml file,so i had written mapper file , reducer file in python.

suppose the input need to process is test.xml

<report>
 <report-name name="ALL_TIME_KEYWORDS_PERFORMANCE_REPORT"/>
 <date-range date="All Time"/>
 <table>
  <columns>
   <column name="campaignID" display="Campaign ID"/>
   <column name="adGroupID" display="Ad group ID"/>
  </columns>
  <row campaignID="79057390" adGroupID="3451305670"/>
  <row campaignID="79057390" adGroupID="3451305670"/>
 </table>
</report>

mapper.py file

import sys
import cStringIO
import xml.etree.ElementTree as xml

if __name__ == '__main__':
    buff = None
    intext = False
    for line in sys.stdin:
        line = line.strip()
        if line.find("<row") != -1:
        .............
        .............
        .............
        print '%s\t%s'%(campaignID,adGroupID )

reducer.py file

import sys
if __name__ == '__main__':
    for line in sys.stdin:
        print line.strip()

I had run the hadoop with following command

bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar 
- file /path/to/mapper.py file -mapper /path/to/mapper.py file 
-file /path/to/reducer.py file -reducer /path/to/reducer.py file 
-input /path/to/input_file/test.xml 
-output /path/to/output_folder/to/store/file

When i run the above command hadoop is creating an output file at output path in the format we mentioned in reducer.py file correctly with required data

Now after all what i am trying to do is, i dont want to store output data in a text file created as default by haddop when i run above command, instead i want to save the data in to a MYSQL database

so i had written some python code in reducer.py file that writes the data directly to MYSQL database , and tried to run the above command by removing the output path as below

bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar 
- file /path/to/mapper.py file -mapper /path/to/mapper.py file 
-file /path/to/reducer.py file -reducer /path/to/reducer.py file 
-input /path/to/input_file/test.xml

And i am getting the error something like below

12/11/08 15:20:49 ERROR streaming.StreamJob: Missing required option: output
Usage: $HADOOP_HOME/bin/hadoop jar \
          $HADOOP_HOME/hadoop-streaming.jar [options]
Options:
  -input    <path>     DFS input file(s) for the Map step
  -output   <path>     DFS output directory for the Reduce step
  -mapper   <cmd|JavaClassName>      The streaming command to run
  -combiner <cmd|JavaClassName> The streaming command to run
  -reducer  <cmd|JavaClassName>      The streaming command to run
  -file     <file>     File/dir to be shipped in the Job jar file
  -inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional.
  -outputformat TextOutputFormat(default)|JavaClassName  Optional.
   .........................
   .........................

After all my doubt is how to save the data in Database after processing the files ?
In which file(mapper.py/reducer.py ? ) can we write the code that writes the data in to database
which command is used to run hadoop for saving data in to database, becuase when i removed the output folder path in the hadoop command, it is showing an error.

Can anyone please help me to solve the above problem………….

Edited

Processed followed

Created mapper and reducer files as above that reads the xml file and creates a text file at some folder by hadoop command

Ex: The folder where the text file(the result of xml file processing with hadoop command) is below

/home/local/user/Hadoop/xml_processing/xml_output/part-00000

Here the xml file size is 1.3 GB and after processing with hadoop the size of the text file created is 345 MB

Now what all i want to do is reading the text file in the above path and saving data to the mysql database as fast as possible.

I have tried this with basic python, but is is taking some 350 sec to process text file and saving to mysql database.

Now as indicated by nichole downloaded sqoop and unzipped at some path like below

/home/local/user/sqoop-1.4.2.bin__hadoop-0.20

And entered in to bin folder and typed ./sqoop and i received the below error

sh-4.2$ ./sqoop
Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: $HADOOP_HOME is deprecated.

Try 'sqoop help' for usage.

Also i have tried below

./sqoop export --connect jdbc:mysql://localhost/Xml_Data --username root --table PerformaceReport --export-dir /home/local/user/Hadoop/xml_processing/xml_output/part-00000 --input-fields-terminated-by '\t'

Result

Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: $HADOOP_HOME is deprecated.

12/11/27 11:54:57 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
12/11/27 11:54:57 INFO tool.CodeGenTool: Beginning code generation
12/11/27 11:54:57 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver
java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver
    at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:636)
    at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52)
    at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:525)
    at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:548)
    at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:191)
    at org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:175)
    at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:262)
    at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1235)
    at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1060)
    at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:82)
    at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:64)
    at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:97)
    at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
    at org.apache.sqoop.Sqoop.main(Sqoop.java:238)
    at com.cloudera.sqoop.Sqoop.main(Sqoop.java:57)

Whether the above sqoop command is useful for the functionality of reading the text file and saving in to database ? , because we have to process from text file and insert in to database !!!!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T04:32:32+00:00

I code all my hadoop MR jobs in python. Let me just say that you need not use python for moving data. Use Sqoop : http://sqoop.apache.org/

Sqoop is an open-source tool that allows users to extract data from a relational database into Hadoop for further processing. And its very simple to use. All you need to do is

Download and configure sqoop
Create your mysql table schema
Specify hadoop hdfs file name, result table name and column seperator.

Read this for more info : http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html

Advantage of using sqoop is that we can now convert our hdfs data to any type of relational database (mysql,derby,hive,etc) and vice versa with a single line command

For your use case, please do necessary changes :

mapper.py

#!/usr/bin/env python

import sys
for line in sys.stdin:
        line = line.strip()
        if line.find("<row") != -1 :
            words=line.split(' ')
            campaignID=words[1].split('"')[1]
            adGroupID=words[2].split('"')[1]
            print "%s:%s:"%(campaignID,adGroupID)

streaming command

bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar - file /path/to/mapper.py file -mapper /path/to/mapper.py file -file /path/to/reducer.py file -reducer /path/to/reducer.py file -input /user/input -output /user/output

mysql

create database test;
use test;
create table testtable ( a varchar (100), b varchar(100) );

sqoop

./sqoop export --connect jdbc:mysql://localhost/test --username root --table testnow --export-dir /user/output --input-fields-terminated-by ':'

Note :

Please change mapper as per your need
I have used ‘:’ as my column seperator in both the mapper and in sqoop command. Change as per needed.
Sqoop tutorials : I have personally followed Hadoop:The Definitive Guide (Oreilly) as well as http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am using hadoop to process an xml file,so i had written mapper file

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply