I have a quick Hadoop Streaming question. If I’m using Python streaming and I have Python packages that my mappers/reducers require but aren’t installed by default do I need to install those on all the Hadoop machines as well or is there some sort of serialization that sends them to the remote machines?
Share
If they’re not installed on your task boxes, you can send them with -file. If you need a package or other directory structure, you can send a zipfile, which will be unpacked for you. Here’s a Haddop 0.17 invocation:
However, see this issue for a caveat:
https://issues.apache.org/jira/browse/MAPREDUCE-596