I’m running the following MapReduce on AWS Elastic MapReduce:
./elastic-mapreduce –create –stream –name CLI_FLOW_LARGE –mapper
s3://classify.mysite.com/mapper.py –reducer
s3://classify.mysite.com/reducer.py –input
s3n://classify.mysite.com/s3_list.txt –output
s3://classify.mysite.com/dat_output4/ –cache
s3n://classify.mysite.com/classifier.py#classifier.py –cache-archive
s3n://classify.mysite.com/policies.tar.gz#policies –bootstrap-action
s3://classify.mysite.com/bootstrap.sh –enable-debugging
–master-instance-type m1.large –slave-instance-type m1.large –instance-type m1.large
For some reason the cacheFile classifier.py is not being cached, it would seem. I get this error when the reducer.py tries to import it:
File "/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201204290242_0001/attempt_201204290242_0001_r_000000_0/work/./reducer.py", line 12, in <module>
from classifier import text_from_html, train_classifiers
ImportError: No module named classifier
classifier.py is most definitely present at s3n://classify.mysite.com/classifier.py. For what it’s worth, the policies archive seems to load in just fine.
I don’t know how to fix this problem in EC2, but I’ve seen it before with Python in traditional Hadoop deployments. Hopefully the lesson translates over.
What we need to do is add the directory
reduce.pyis in to the python path, because presumablyclassifier.pyis in there too. For whatever reason, this place is not in the python path, so it is failing to findclassifier.The reason why your code might work locally is because of the current working directory in which you are running it from. Hadoop might not be running it from the same place you are in terms of the current working directory.