Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3424636
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 18, 20262026-05-18T06:29:53+00:00 2026-05-18T06:29:53+00:00

I have a pipeline that I currently run on a large university computer cluster.

  • 0

I have a pipeline that I currently run on a large university computer cluster. For publication purposes I’d like to convert it into mapreduce format such that it could be run by anyone on using a hadoop cluster such as amazon webservices (AWS). The pipeline currently consists of as series of python scripts that wrap different binary executables and manage the input and output using the python subprocess and tempfile modules. Unfortunately I didn’t write the binary executables and many of them either don’t take STDIN or don’t emit STDOUT in a ‘useable’ fashion (e.g., only sent it to files). These problems are why I’ve wrapped most of them in python.

So far I’ve been able to modify my Python code such that I have a mapper and a reducer that I can run on my local machine in the standard ‘test format.’

$ cat data.txt | mapper.py | reducer.py

The mapper formats each line of data the way the binary it wraps wants it, sends the text to the binary using subprocess.popen (this also allows me to mask a lot of spurious STDOUT), then collects the STOUT I want, and formats it into lines of text appropriate for the reducer.
The problems arise when I try to replicate the command on a local hadoop install. I can get the mapper to execute, but it give an error that suggests that it can’t find the binary executable.

File
“/Users/me/Desktop/hadoop-0.21.0/./phyml.py”,
line 69, in
main() File “/Users/me/Desktop/hadoop-0.21.0/./mapper.py”,
line 66, in main
phyml(None) File “/Users/me/Desktop/hadoop-0.21.0/./mapper.py”,
line 46, in phyml
ft = Popen(cli_parts, stdin=PIPE, stderr=PIPE, stdout=PIPE) File
“/Library/Frameworks/Python.framework/Versions/6.1/lib/python2.6/subprocess.py”,
line 621, in init
errread, errwrite) File “/Library/Frameworks/Python.framework/Versions/6.1/lib/python2.6/subprocess.py”,
line 1126, in _execute_child
raise child_exception
OSError: [Errno 13] Permission denied

My hadoop command looks like the following:

./bin/hadoop jar /Users/me/Desktop/hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar \
-input /Users/me/Desktop/Code/AWS/temp/data.txt \
-output /Users/me/Desktop/aws_test \
-mapper  mapper.py \
-reducer  reducer.py \
-file /Users/me/Desktop/Code/AWS/temp/mapper.py \
-file /Users/me/Desktop/Code/AWS/temp/reducer.py \
-file /Users/me/Desktop/Code/AWS/temp/binary

As I noted above it looks to me like the mapper isn’t aware of the binary – perhaps it’s not being sent to the compute node? Unfortunately I can’t really tell what the problem is. Any help would be greatly appreciated. It would be particulary nice to see some hadoop streaming mappers/reducers written in python that wrap binary executables. I can’t imagine I’m the first one to try to do this! In fact, here is another post asking essentially the same question, but it hasn’t been answered yet…

Hadoop/Elastic Map Reduce with binary executable?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-18T06:29:53+00:00Added an answer on May 18, 2026 at 6:29 am

    After much googling (etc.) I figured out how to include executable binaries/scripts/modules that are accessible to your mappers/reducers. The trick is to upload all you files to hadoop first.

    $ bin/hadoop dfs -copyFromLocal /local/file/system/module.py module.py
    

    Then you need to format you streaming command like the following template:

    $ ./bin/hadoop jar /local/file/system/hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar \
    -file /local/file/system/data/data.txt \
    -file /local/file/system/mapper.py \
    -file /local/file/system/reducer.py \
    -cacheFile hdfs://localhost:9000/user/you/module.py#module.py \
    -input data.txt \
    -output output/ \
    -mapper mapper.py \
    -reducer reducer.py \
    -verbose
    

    If you’re linking a python module you’ll need to add the following code to your mapper/reducer scripts:

    import sys 
    sys.path.append('.')
    import module
    

    If you’re accessing a binary via subprocessing your command should look something like this:

    cli = "./binary %s" % (argument)
    cli_parts = shlex.split(cli)
    mp = Popen(cli_parts, stdin=PIPE, stderr=PIPE, stdout=PIPE)
    mp.communicate()[0]
    

    Hope this helps.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a large project that runs on an application server. It does pipelined
I'm currently working on a game/engine that uses OpenGL for rendering, and recently have
I have an application that I use to run Exchange Powershell commands inside C#
I have a large set of vertices and currently use glColorPointer to specify their
Currently, I'm using the RX Framework to implement a workflow-like message handling pipeline. Essentially
Currently, I have a control flow that connects to more than 20 dbs (same
I'm trying to convert a Rails 3.1 app to use the asset pipeline. Currently
Have just started converting an existing job tracking system into an ASP.NET MVC application.
My software uses multiple threads to do its work. There is a pipeline that
Currently, my MVC 3 app has a dependency on a static class that is

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.