I have a pipeline that I currently run on a large university computer cluster.

Question

0

Asked: May 18, 20262026-05-18T06:29:53+00:00 2026-05-18T06:29:53+00:00

I have a pipeline that I currently run on a large university computer cluster.

0

I have a pipeline that I currently run on a large university computer cluster. For publication purposes I’d like to convert it into mapreduce format such that it could be run by anyone on using a hadoop cluster such as amazon webservices (AWS). The pipeline currently consists of as series of python scripts that wrap different binary executables and manage the input and output using the python subprocess and tempfile modules. Unfortunately I didn’t write the binary executables and many of them either don’t take STDIN or don’t emit STDOUT in a ‘useable’ fashion (e.g., only sent it to files). These problems are why I’ve wrapped most of them in python.

So far I’ve been able to modify my Python code such that I have a mapper and a reducer that I can run on my local machine in the standard ‘test format.’

$ cat data.txt | mapper.py | reducer.py

The mapper formats each line of data the way the binary it wraps wants it, sends the text to the binary using subprocess.popen (this also allows me to mask a lot of spurious STDOUT), then collects the STOUT I want, and formats it into lines of text appropriate for the reducer.
The problems arise when I try to replicate the command on a local hadoop install. I can get the mapper to execute, but it give an error that suggests that it can’t find the binary executable.

File
“/Users/me/Desktop/hadoop-0.21.0/./phyml.py”,
line 69, in
main() File “/Users/me/Desktop/hadoop-0.21.0/./mapper.py”,
line 66, in main
phyml(None) File “/Users/me/Desktop/hadoop-0.21.0/./mapper.py”,
line 46, in phyml
ft = Popen(cli_parts, stdin=PIPE, stderr=PIPE, stdout=PIPE) File
“/Library/Frameworks/Python.framework/Versions/6.1/lib/python2.6/subprocess.py”,
line 621, in init
errread, errwrite) File “/Library/Frameworks/Python.framework/Versions/6.1/lib/python2.6/subprocess.py”,
line 1126, in _execute_child
raise child_exception
OSError: [Errno 13] Permission denied

My hadoop command looks like the following:

./bin/hadoop jar /Users/me/Desktop/hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar \
-input /Users/me/Desktop/Code/AWS/temp/data.txt \
-output /Users/me/Desktop/aws_test \
-mapper  mapper.py \
-reducer  reducer.py \
-file /Users/me/Desktop/Code/AWS/temp/mapper.py \
-file /Users/me/Desktop/Code/AWS/temp/reducer.py \
-file /Users/me/Desktop/Code/AWS/temp/binary

As I noted above it looks to me like the mapper isn’t aware of the binary – perhaps it’s not being sent to the compute node? Unfortunately I can’t really tell what the problem is. Any help would be greatly appreciated. It would be particulary nice to see some hadoop streaming mappers/reducers written in python that wrap binary executables. I can’t imagine I’m the first one to try to do this! In fact, here is another post asking essentially the same question, but it hasn’t been answered yet…

Hadoop/Elastic Map Reduce with binary executable?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-18T06:29:53+00:00

After much googling (etc.) I figured out how to include executable binaries/scripts/modules that are accessible to your mappers/reducers. The trick is to upload all you files to hadoop first.

$ bin/hadoop dfs -copyFromLocal /local/file/system/module.py module.py

Then you need to format you streaming command like the following template:

$ ./bin/hadoop jar /local/file/system/hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar \
-file /local/file/system/data/data.txt \
-file /local/file/system/mapper.py \
-file /local/file/system/reducer.py \
-cacheFile hdfs://localhost:9000/user/you/module.py#module.py \
-input data.txt \
-output output/ \
-mapper mapper.py \
-reducer reducer.py \
-verbose

If you’re linking a python module you’ll need to add the following code to your mapper/reducer scripts:

import sys 
sys.path.append('.')
import module

If you’re accessing a binary via subprocessing your command should look something like this:

cli = "./binary %s" % (argument)
cli_parts = shlex.split(cli)
mp = Popen(cli_parts, stdin=PIPE, stderr=PIPE, stdout=PIPE)
mp.communicate()[0]

Hope this helps.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a pipeline that I currently run on a large university computer cluster.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply