I m working on PIG script which performs heavy duty data processing on raw

Question

0

Asked: June 17, 20262026-06-17T19:03:45+00:00 2026-06-17T19:03:45+00:00

I m working on PIG script which performs heavy duty data processing on raw

0

I m working on PIG script which performs heavy duty data processing on raw transactions and come up with various transaction patterns.

Say one of pattern is – find all accounts who received cross border transactions in a day (with total transaction and amount of transactions).

My expected output should be two data files
1) Rollup data – like account A1 received 50 transactions from country AU.
2) Raw transactions – all above 50 transactions for A1.

My PIG script is currently creating output data source in following format

Account Country TotalTxns RawTransactions

A1 AU 50 [(Txn1), (Txn2), (Txn3)….(Txn50)]

A2 JP 30 [(Txn1), (Txn2)….(Txn30)]

Now question here is, when I get this data out of Hadoop system (to some DB) I want to establish link between my rollup record (A1, AU, 50) with all 50 raw transactions (like ID 1 for rollup record used as foreign key for all 50 associated Txns).

I understand Hadoop being distributed should not be used for assigning IDs, but are there any options where i can assign non-unique Ids (no need to be sequential) or some other way to link this data?

EDIT (after using Enumerate from DataFu)
here is the PIG script

register /UDF/datafu-0.0.8.jar
define Enumerate datafu.pig.bags.Enumerate('1');
data_txn = LOAD './txndata' USING PigStorage(',') AS (txnid:int, sndr_acct:int,sndr_cntry:chararray, rcvr_acct:int, rcvr_cntry:chararray);
data_txn1 = GROUP data_txn ALL;
data_txn2 = FOREACH data_txn1 GENERATE flatten(Enumerate(data_txn));
dump data_txn2;

after running this, I am getting

ERROR org.apache.pig.tools.pigstats.SimplePigStats – ERROR 2997: Unable to recreate exception from backed error: java.lang.NullPointerException
at datafu.pig.bags.Enumerate.enumerateBag(Enumerate.java:89)
at datafu.pig.bags.Enumerate.accumulate(Enumerate.java:104)
….

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T19:03:46+00:00

I often assign random ids in Hadoop jobs. You just need to ensure you generate ids which contain a sufficient number of random bits to ensure the probability of collisions is sufficiently small (http://en.wikipedia.org/wiki/Birthday_problem).

As a rule of thumb I use 3*log(n) random bits where n = # of ids that need to be generated.

In many cases Java’s UUID.randomUUID() will be sufficient.

http://en.wikipedia.org/wiki/Universally_unique_identifier#Random_UUID_probability_of_duplicates

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I m working on PIG script which performs heavy duty data processing on raw

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply