I’ve seen a technique (on stackoverflow) for executing a hadoop streaming job using zip files to store referenced python modules.
I’m having some errors during the mapping phase of my job’s execution. I’m fairly certain it’s related to the zip’d module loading.
To debug the script, I have run my data set through sys.stdin/sys.stdout using command line pipes into my mapper and reducer so something like this:
head inputdatafile.txt | ./mapper.py | sort -k1,1 | ./reducer.py
the results look great.
When I run this through hadoop though, I start hitting some problems. ie: the mapper and reducer fail and the entire hadoop job fails completely.
My zip’d module file contains *.pyc files – is that going to impact this thing?
Also where can I find the errors generated during the map/reduction process using hadoop streaming?
I’ve used the -file command line argument to tell hadoop where the zip’d module is located and where my mapper and reducer scripts are located.
i’m not doing any crazy configuration options to increase the number of mappers and reducers used in the job.
any help would be greatly appreciated! thanks!
After reviewing sent_tokenize’s source code, it looks like the nltk.sent_tokenize AND the nltk.tokenize.sent_tokenize methods/functions rely on a pickle file (one used to do punkt tokenization) to operate.
Since this is Hadoop-streaming, you’d have to figure out where/how to place that pickle file into the zip’d code module that is added into the hadoop job’s jar.
Bottom line? I recommend using the RegexpTokenizer class to do sentence and word level tokenization.