When working with TeraBytes of data, and for a typical data filtering problem, is Apache PIG the right choice? Or is it better to have a custom MapReduce code doing the job.
When working with TeraBytes of data, and for a typical data filtering problem, is
Share
Apache PIG does not serve as a storage layer. PIG is a scripting language that simplifies creation of the code that can run on Hadoop. PIG script is compiled into a set of Hadoop MapReduce jobs that are submitted to the Hadoop and which run in the same way as any other MapReduce Job.
Hadoop does the data storage and not PIG.
To answer your question: No, there are no limitations on the size of the input data. As long as the input data can be parsed by PIG load functions and it is splittable by the Hadoop InputFormats.
PIG scripts are easier and faster to write than standard Java Hadoop jobs and PIG has lot of clever optimizations like multiquery execution, which can make your complex queries execute quicker.