I have a Pig job which analyzes log files and write summary output to S3. Instead of writing the output to S3, I want to convert it to a JSON payload and POST it to a URL.
Some notes:
- This job is running on Amazon Elastic MapReduce.
- I can use a STREAM to pipe the data through an external command, and load it from there. But because Pig never sends an EOF to external commands, this means I need to POST each row as it arrives, and I can’t batch them. Obviously, this hurts performance.
What’s the best way to address this problem? Is there something in PiggyBank or another library that I can use? Or should I write a new storage adapter? Thank you for your advice!
As it turns out, Pig does correctly send EOF to external commands, so you do have the option of streaming everything through an external script. If it isn’t working, then you probably have a hard-to-debug configuration problem.
Here’s how to get started. Define an external command as follows, using whatever interpreter and script you need:
Stream the results through your script:
From Ruby, you can batch the input records into blocks of 1024:
If this fails to work, check the following carefully:
Some of these are hard to verify, but if any of them are failing, you can easily waste quite a lot of time debugging.
Note, however, that you should strongly consider the alternative approaches recommended by mat kelcey.