I wrote my UDF to load file into Pig. It works well for loading text file, however, now I need also be able to read .gz file. I know I can unzip the file then process, but I want just read .gz file without to unzip it.
I have my UDF extends from LoadFunc, then in my costom input file MyInputFile extends TextInputFormat. I also Implemented MyRecordReader. Just wondering if extends TextInputFormat is the problem? I tried FileInputFormat, still cannot read the file. Anyone wrote UDF read data from .gz file before?
TextInputFormathandles gzip files as well. Have a look at its RecordReader’s (LineRecordReader) initialize() method where the proper CompressionCodec is initialized. Also note that gzip files aren’t splittable (even if they are located on S3) so you might either need to use a splittable format (e.g: LZO) or an uncompressed data to exploit the desired level of parallel processing.If your gzipped data is stored locally you can uncompress and copy it to hdfs in one step as described here. Or if it’s already on hdfs
hadoop fs -cat /data/data.gz | gzip -d | hadoop fs -put - /data/data.txtwould be more convenient.