I am trying to figure out how to set class path that reference to HDFS? I cannot find any reference.
java -cp "how to reference to HDFS?" com.MyProgram
If i cannot reference to hadoop file system, then i have to copy all the referenced third party libs/jars somewhere under $HADOOP_HOME on each hadoop machine…but i wanna avoid this by putting files to hadoop file system. Is this possible?
Example hadoop command line for the program to run (my expectation is like this, maybe i am wrong):
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.3.jar -input inputfileDir -output outputfileDir -mapper /home/nanshi/myprog.java -reducer NONE -file /home/nanshi/myprog.java
However, within the command line above, how do i added java classpath? like
-cp “/home/nanshi/wiki/Lucene/lib/lucene-core-3.6.0.jar:/home/nanshi/Lucene/bin”
You cannot add to your classpath a HDFS path. The java executable wouldn’t be able to interpret something like :
But adding third party libraries to the classpath of each task needing those libraries can be done using the -libjars option. This means you need to have a so called driver class (implementing Tool) which sets up and starts your job and use the -libjars option on the command line when running that driver class.
The Tool, in turn, uses GenericParser to parse your command line arguments (including -libjars) and with the help of the JobClient will do all the necessary work to send your lib to all the machines needing them and to set them on the classpath of those machines.
Besides that, in order to run a MR job you should use the hadoop script located in the bin/ directory of your distribution.
Here is an example (using a jar containing your job and the driver class):