Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8816113
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 14, 20262026-06-14T04:32:31+00:00 2026-06-14T04:32:31+00:00

I am using hadoop to process an xml file,so i had written mapper file

  • 0

I am using hadoop to process an xml file,so i had written mapper file , reducer file in python.

suppose the input need to process is test.xml

<report>
 <report-name name="ALL_TIME_KEYWORDS_PERFORMANCE_REPORT"/>
 <date-range date="All Time"/>
 <table>
  <columns>
   <column name="campaignID" display="Campaign ID"/>
   <column name="adGroupID" display="Ad group ID"/>
  </columns>
  <row campaignID="79057390" adGroupID="3451305670"/>
  <row campaignID="79057390" adGroupID="3451305670"/>
 </table>
</report>

mapper.py file

import sys
import cStringIO
import xml.etree.ElementTree as xml

if __name__ == '__main__':
    buff = None
    intext = False
    for line in sys.stdin:
        line = line.strip()
        if line.find("<row") != -1:
        .............
        .............
        .............
        print '%s\t%s'%(campaignID,adGroupID )

reducer.py file

import sys
if __name__ == '__main__':
    for line in sys.stdin:
        print line.strip()

I had run the hadoop with following command

bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar 
- file /path/to/mapper.py file -mapper /path/to/mapper.py file 
-file /path/to/reducer.py file -reducer /path/to/reducer.py file 
-input /path/to/input_file/test.xml 
-output /path/to/output_folder/to/store/file

When i run the above command hadoop is creating an output file at output path in the format we mentioned in reducer.py file correctly with required data

Now after all what i am trying to do is, i dont want to store output data in a text file created as default by haddop when i run above command, instead i want to save the data in to a MYSQL database

so i had written some python code in reducer.py file that writes the data directly to MYSQL database , and tried to run the above command by removing the output path as below

bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar 
- file /path/to/mapper.py file -mapper /path/to/mapper.py file 
-file /path/to/reducer.py file -reducer /path/to/reducer.py file 
-input /path/to/input_file/test.xml 

And i am getting the error something like below

12/11/08 15:20:49 ERROR streaming.StreamJob: Missing required option: output
Usage: $HADOOP_HOME/bin/hadoop jar \
          $HADOOP_HOME/hadoop-streaming.jar [options]
Options:
  -input    <path>     DFS input file(s) for the Map step
  -output   <path>     DFS output directory for the Reduce step
  -mapper   <cmd|JavaClassName>      The streaming command to run
  -combiner <cmd|JavaClassName> The streaming command to run
  -reducer  <cmd|JavaClassName>      The streaming command to run
  -file     <file>     File/dir to be shipped in the Job jar file
  -inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional.
  -outputformat TextOutputFormat(default)|JavaClassName  Optional.
   .........................
   .........................
  1. After all my doubt is how to save the data in Database after processing the files ?
  2. In which file(mapper.py/reducer.py ? ) can we write the code that writes the data in to database
  3. which command is used to run hadoop for saving data in to database, becuase when i removed the output folder path in the hadoop command, it is showing an error.

Can anyone please help me to solve the above problem………….

Edited

Processed followed

  1. Created mapper and reducer files as above that reads the xml file and creates a text file at some folder by hadoop command

    Ex: The folder where the text file(the result of xml file processing with hadoop command) is below

    /home/local/user/Hadoop/xml_processing/xml_output/part-00000

Here the xml file size is 1.3 GB and after processing with hadoop the size of the text file created is 345 MB

Now what all i want to do is reading the text file in the above path and saving data to the mysql database as fast as possible.

I have tried this with basic python, but is is taking some 350 sec to process text file and saving to mysql database.

  1. Now as indicated by nichole downloaded sqoop and unzipped at some path like below

    /home/local/user/sqoop-1.4.2.bin__hadoop-0.20

And entered in to bin folder and typed ./sqoop and i received the below error

sh-4.2$ ./sqoop
Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: $HADOOP_HOME is deprecated.

Try 'sqoop help' for usage.

Also i have tried below

./sqoop export --connect jdbc:mysql://localhost/Xml_Data --username root --table PerformaceReport --export-dir /home/local/user/Hadoop/xml_processing/xml_output/part-00000 --input-fields-terminated-by '\t'

Result

Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: $HADOOP_HOME is deprecated.

12/11/27 11:54:57 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
12/11/27 11:54:57 INFO tool.CodeGenTool: Beginning code generation
12/11/27 11:54:57 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver
java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver
    at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:636)
    at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52)
    at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:525)
    at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:548)
    at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:191)
    at org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:175)
    at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:262)
    at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1235)
    at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1060)
    at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:82)
    at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:64)
    at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:97)
    at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
    at org.apache.sqoop.Sqoop.main(Sqoop.java:238)
    at com.cloudera.sqoop.Sqoop.main(Sqoop.java:57)

Whether the above sqoop command is useful for the functionality of reading the text file and saving in to database ? , because we have to process from text file and insert in to database !!!!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-14T04:32:32+00:00Added an answer on June 14, 2026 at 4:32 am

    I code all my hadoop MR jobs in python. Let me just say that you need not use python for moving data. Use Sqoop : http://sqoop.apache.org/

    Sqoop is an open-source tool that allows users to extract data from a relational database into Hadoop for further processing. And its very simple to use. All you need to do is

    1. Download and configure sqoop
    2. Create your mysql table schema
    3. Specify hadoop hdfs file name, result table name and column seperator.

    Read this for more info : http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html

    Advantage of using sqoop is that we can now convert our hdfs data to any type of relational database (mysql,derby,hive,etc) and vice versa with a single line command

    For your use case, please do necessary changes :

    mapper.py

    #!/usr/bin/env python
    
    import sys
    for line in sys.stdin:
            line = line.strip()
            if line.find("<row") != -1 :
                words=line.split(' ')
                campaignID=words[1].split('"')[1]
                adGroupID=words[2].split('"')[1]
                print "%s:%s:"%(campaignID,adGroupID)
    

    streaming command

    bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar - file /path/to/mapper.py file -mapper /path/to/mapper.py file -file /path/to/reducer.py file -reducer /path/to/reducer.py file -input /user/input -output /user/output
    

    mysql

    create database test;
    use test;
    create table testtable ( a varchar (100), b varchar(100) );
    

    sqoop

    ./sqoop export --connect jdbc:mysql://localhost/test --username root --table testnow --export-dir /user/output --input-fields-terminated-by ':'
    

    Note :

    1. Please change mapper as per your need
    2. I have used ‘:’ as my column seperator in both the mapper and in sqoop command. Change as per needed.
    3. Sqoop tutorials : I have personally followed Hadoop:The Definitive Guide (Oreilly) as well as http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html.
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am trying to process xml files using Hadoop's StreamInputFormat. And I am using
I have a huge CSV file I would like to process using Hadoop MapReduce
I want to process a large number of pickled data with Hadoop using Python.
I want to read the PDF file using hadoop, how it is possible? I
Now I am using Hadoop to process the data that will finally be loaded
I've got 1000's of files to process. Each file consists of 1000's of XML
I am using hadoop hdfs to store large data. I need to first transfer
I am using Python and have to work on following scenario using Hadoop Streaming:
I am using Hadoop example program WordCount to process large set of small files/web
I have a small hadoop/hive cluster (6 nodes in total). Using hadoop dfsadmin -report

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.