I’ve splitted big binary file to (2Gb) chunks and uploaded it to Amazon S3.
Now I want to join it back to one file and process with my custom
I’ve tried to run
elastic-mapreduce -j $JOBID -ssh \
"hadoop dfs -cat s3n://bucket/dir/in/* > s3n://bucket/dir/outfile"
but it failed due to -cat output data to my local terminal – it does not work remotely…
How I can do this?
P.S. I’ve tried to run cat as a streaming MR job:
den@aws:~$ elastic-mapreduce --create --stream --input s3n://bucket/dir/in \
--output s3n://bucket/dir/out --mapper /bin/cat --reducer NONE
this job was finished successfully. But. I had 3 file parts in dir/in – now I have 6 parts in /dir/out
part-0000
part-0001
part-0002
part-0003
part-0004
part-0005
And file _SUCCESS ofcource which is not part of my output…
So. How to join splitted before file?
So. I’ve found a solution. Maybe not better – but it is working.
So. I’ve created an EMR job flow with bootstrap action
in that joinfiles.sh I’m downloading my file pieces from S3 using wget, join them using regular cat a b c > abc.
After that I’ve added a s3distcp which copied result back to S3. ( sample could be found at: https://stackoverflow.com/a/12302277/658346 ).
That is all.