I’m trying to use Hadoop Streaming to run two commands such as gunzip | map_to_old_format.py, but it errors with gzip saying “|.gz not found” or something along those lines (only when run through Hadoop.. if I run on command line, it works fine).
Since I can’t figure out how to gunzip in Python on the fly, I would like to create one shell script that does this command combining for me (e.g. gunzip_and_map_to_old.sh). I tried this with the following, but gzip didn’t like (gzip complains “gzip: stdin: not in gzip format”):
#!/bin/bash
while read data; do
echo $data | gunzip | map_to_old_format.py $2
done
Regarding python gunzip, I tried f = gzip.GzipFile("", "rb", fileobj=sys.stdin) as well as a Wrapper method described here.
I know nothing about Hadoop, but I’m going to guess that
echo $data | gunzipdoesn’t work because$datais a line ofdata, and$databy itself is probably not in the gzip format. Instead of passing it the data line by line, can’t you just do this in the bash script file?You can then call it by passing in the gzip file like this: