I am using hadoop to process an xml file,so i had written mapper file , reducer file in python.
suppose the input need to process is test.xml
<report>
<report-name name="ALL_TIME_KEYWORDS_PERFORMANCE_REPORT"/>
<date-range date="All Time"/>
<table>
<columns>
<column name="campaignID" display="Campaign ID"/>
<column name="adGroupID" display="Ad group ID"/>
</columns>
<row campaignID="79057390" adGroupID="3451305670"/>
<row campaignID="79057390" adGroupID="3451305670"/>
</table>
</report>
mapper.py file
import sys
import cStringIO
import xml.etree.ElementTree as xml
if __name__ == '__main__':
buff = None
intext = False
for line in sys.stdin:
line = line.strip()
if line.find("<row") != -1:
.............
.............
.............
print '%s\t%s'%(campaignID,adGroupID )
reducer.py file
import sys
if __name__ == '__main__':
for line in sys.stdin:
print line.strip()
I had run the hadoop with following command
bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar
- file /path/to/mapper.py file -mapper /path/to/mapper.py file
-file /path/to/reducer.py file -reducer /path/to/reducer.py file
-input /path/to/input_file/test.xml
-output /path/to/output_folder/to/store/file
When i run the above command hadoop is creating an output file at output path in the format we mentioned in reducer.py file correctly with required data
Now after all what i am trying to do is, i dont want to store output data in a text file created as default by haddop when i run above command, instead i want to save the data in to a MYSQL database
so i had written some python code in reducer.py file that writes the data directly to MYSQL database , and tried to run the above command by removing the output path as below
bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar
- file /path/to/mapper.py file -mapper /path/to/mapper.py file
-file /path/to/reducer.py file -reducer /path/to/reducer.py file
-input /path/to/input_file/test.xml
And i am getting the error something like below
12/11/08 15:20:49 ERROR streaming.StreamJob: Missing required option: output
Usage: $HADOOP_HOME/bin/hadoop jar \
$HADOOP_HOME/hadoop-streaming.jar [options]
Options:
-input <path> DFS input file(s) for the Map step
-output <path> DFS output directory for the Reduce step
-mapper <cmd|JavaClassName> The streaming command to run
-combiner <cmd|JavaClassName> The streaming command to run
-reducer <cmd|JavaClassName> The streaming command to run
-file <file> File/dir to be shipped in the Job jar file
-inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional.
-outputformat TextOutputFormat(default)|JavaClassName Optional.
.........................
.........................
- After all my doubt is how to save the data in
Databaseafter processing the files ? - In which file(mapper.py/reducer.py ? ) can we write the code that writes the data in to database
- which command is used to run hadoop for saving data in to database, becuase when i removed the output folder path in the hadoop command, it is showing an error.
Can anyone please help me to solve the above problem………….
Edited
Processed followed
-
Created
mapperandreducerfiles as above that reads the xml file and creates a text file at some folder byhadoop commandEx: The folder where the text file(the result of xml file processing with hadoop command) is below
/home/local/user/Hadoop/xml_processing/xml_output/part-00000
Here the xml file size is 1.3 GB and after processing with hadoop the size of the text file created is 345 MB
Now what all i want to do is reading the text file in the above path and saving data to the mysql database as fast as possible.
I have tried this with basic python, but is is taking some 350 sec to process text file and saving to mysql database.
-
Now as indicated by nichole downloaded sqoop and unzipped at some path like below
/home/local/user/sqoop-1.4.2.bin__hadoop-0.20
And entered in to bin folder and typed ./sqoop and i received the below error
sh-4.2$ ./sqoop
Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: $HADOOP_HOME is deprecated.
Try 'sqoop help' for usage.
Also i have tried below
./sqoop export --connect jdbc:mysql://localhost/Xml_Data --username root --table PerformaceReport --export-dir /home/local/user/Hadoop/xml_processing/xml_output/part-00000 --input-fields-terminated-by '\t'
Result
Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: $HADOOP_HOME is deprecated.
12/11/27 11:54:57 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
12/11/27 11:54:57 INFO tool.CodeGenTool: Beginning code generation
12/11/27 11:54:57 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver
java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver
at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:636)
at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52)
at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:525)
at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:548)
at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:191)
at org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:175)
at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:262)
at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1235)
at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1060)
at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:82)
at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:64)
at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:97)
at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
at org.apache.sqoop.Sqoop.main(Sqoop.java:238)
at com.cloudera.sqoop.Sqoop.main(Sqoop.java:57)
Whether the above sqoop command is useful for the functionality of reading the text file and saving in to database ? , because we have to process from text file and insert in to database !!!!
I code all my hadoop MR jobs in python. Let me just say that you need not use python for moving data. Use Sqoop : http://sqoop.apache.org/
Sqoop is an open-source tool that allows users to extract data from a relational database into Hadoop for further processing. And its very simple to use. All you need to do is
Read this for more info : http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html
Advantage of using sqoop is that we can now convert our hdfs data to any type of relational database (mysql,derby,hive,etc) and vice versa with a single line command
For your use case, please do necessary changes :
mapper.py
streaming command
mysql
sqoop
Note :