I am running nutch on hadoop multi cluster environment.
Hadoop is throwing an error when nutch is being executed using the following command
$ bin/hadoop jar /home/nutch/nutch/runtime/deploy/nutch-1.5.1.job org.apache.nutch.crawl.Crawl urls -dir urls -depth 1 -topN 5
Error:
Exception in thread “main” java.io.IOException: Not a file:
hdfs://master:54310/user/nutch/urls/crawldb
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:170)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:515)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753)
at com.bdc.dod.dashboard.BDCQueryStatsViewer.run(BDCQueryStatsViewer.java:829)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at com.bdc.dod.dashboard.BDCQueryStatsViewer.main(BDCQueryStatsViewer.java:796)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
I tried with possible ways of solving this and fixed all the issues like setting http.agent.name in /local/conf path etc. And I installed earlier and it was smooth.
Can anybody suggest a solution?
By the way, I followed link for installing and running.
I could solve this issue. when copying files from local file system to HDFS destination filesystem, it used to be like this: bin/hadoop dfs -put ~/nutch/urls urls.
However it should be “bin/hadoop dfs -put ~/nutch/urls/* urls”, here urls/* will allow sub directories.