I’m running the following MapReduce on AWS Elastic MapReduce: ./elastic-mapreduce –create –stream –name CLI_FLOW_LARGE

Question

0

Asked: June 3, 20262026-06-03T05:45:39+00:00 2026-06-03T05:45:39+00:00

I’m running the following MapReduce on AWS Elastic MapReduce: ./elastic-mapreduce –create –stream –name CLI_FLOW_LARGE

0

I’m running the following MapReduce on AWS Elastic MapReduce:

./elastic-mapreduce –create –stream –name CLI_FLOW_LARGE –mapper
s3://classify.mysite.com/mapper.py –reducer
s3://classify.mysite.com/reducer.py –input
s3n://classify.mysite.com/s3_list.txt –output
s3://classify.mysite.com/dat_output4/ –cache
s3n://classify.mysite.com/classifier.py#classifier.py –cache-archive
s3n://classify.mysite.com/policies.tar.gz#policies –bootstrap-action
s3://classify.mysite.com/bootstrap.sh –enable-debugging
–master-instance-type m1.large –slave-instance-type m1.large –instance-type m1.large

For some reason the cacheFile classifier.py is not being cached, it would seem. I get this error when the reducer.py tries to import it:

  File "/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201204290242_0001/attempt_201204290242_0001_r_000000_0/work/./reducer.py", line 12, in <module>
    from classifier import text_from_html, train_classifiers
ImportError: No module named classifier

classifier.py is most definitely present at s3n://classify.mysite.com/classifier.py. For what it’s worth, the policies archive seems to load in just fine.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-03T05:45:41+00:00

I don’t know how to fix this problem in EC2, but I’ve seen it before with Python in traditional Hadoop deployments. Hopefully the lesson translates over.

What we need to do is add the directory reduce.py is in to the python path, because presumably classifier.py is in there too. For whatever reason, this place is not in the python path, so it is failing to find classifier.

import sys
import os.path

# add the directory where reducer.py is to the python path
sys.path.append(os.path.dirname(__file__))
# __file__ is the location of reduce.py, along with "reduce.py"
# dirname strips the file name and only gives the directory
# sys.path is the python path where it looks for modules

from classifier import text_from_html, train_classifiers

The reason why your code might work locally is because of the current working directory in which you are running it from. Hadoop might not be running it from the same place you are in terms of the current working directory.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m running the following MapReduce on AWS Elastic MapReduce: ./elastic-mapreduce –create –stream –name CLI_FLOW_LARGE

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply