Is it possible to (efficiently) select a random tuple from a bag in pig?
I can just take the first result of a bag (as it is unordered), but in my case I need a proper random selection.
One (not efficient) solution is counting the number of tuples in the bag, take a random number within that range, loop through the bag, and stop whenever the number of iterations matches my random number. Does anyone know of faster/better ways to do this?
Is it possible to (efficiently) select a random tuple from a bag in pig?
Share
You could use RANDOM(), ORDER and LIMIT in a nested FOREACH statement to select one element with the smallest random number:
dump randoms;
INPUT:
OUTPUT:
If you run “dump randoms;” multiple times, you should get different results for each run.
Writing a UDF might give you better performance as you do not need to do secondary sort on random within the bag.