I am trying include a python package (NLTK) with a Hadoop streaming job, but am not sure how to do this without including every file manually via the CLI argument, “-file”.
Edit: One solution would be to install this package on all the slaves, but I don’t have that option currently.
I would zip up the package into a
.tar.gzor a.zipand pass the entire tarball or archive in a-fileoption to your hadoop command. I’ve done this in the past with Perl but not Python.That said, I would think this would still work for you if you use Python’s
zipimportat http://docs.python.org/library/zipimport.html, which allows you to import modules directly from a zip.