I’m using pig for data preparation, and I faced a problem which seems easy but I can’t deal with:
for example, I have a column of names
name
------
Alicia
Ana
Benita
Berta
Bertha
then how can I add a row number for each name? the result would be like this:
name | id
----------------
Alicia | 1
Ana | 2
Benita | 3
Berta | 4
Bertha | 5
Thank you for reading this question!
Unfortunately, there is no way to enumerate rows in Pig Latin. At least, I couldn’t find an easy way. One solution is to implement a separate MapReduce job with single Reduce task that does the actual enumeration. To be more precise,
Map phase: assign all rows to same key.
Single Reduce task: receives single key with an iterator to all rows. Since reduce task will run just on 1 physical machine and “reduce function” will be called just once, local counter inside the function solves the problem.
If the data is huge and impossible to process on single reduce machine, then default MapReduce Counters on master node may be used.