Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9184483
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 17, 20262026-06-17T19:03:45+00:00 2026-06-17T19:03:45+00:00

I m working on PIG script which performs heavy duty data processing on raw

  • 0

I m working on PIG script which performs heavy duty data processing on raw transactions and come up with various transaction patterns.

Say one of pattern is – find all accounts who received cross border transactions in a day (with total transaction and amount of transactions).

My expected output should be two data files
1) Rollup data – like account A1 received 50 transactions from country AU.
2) Raw transactions – all above 50 transactions for A1.

My PIG script is currently creating output data source in following format

Account Country TotalTxns RawTransactions

A1 AU 50 [(Txn1), (Txn2), (Txn3)….(Txn50)]

A2 JP 30 [(Txn1), (Txn2)….(Txn30)]

Now question here is, when I get this data out of Hadoop system (to some DB) I want to establish link between my rollup record (A1, AU, 50) with all 50 raw transactions (like ID 1 for rollup record used as foreign key for all 50 associated Txns).

I understand Hadoop being distributed should not be used for assigning IDs, but are there any options where i can assign non-unique Ids (no need to be sequential) or some other way to link this data?

EDIT (after using Enumerate from DataFu)
here is the PIG script

register /UDF/datafu-0.0.8.jar
define Enumerate datafu.pig.bags.Enumerate('1');
data_txn = LOAD './txndata' USING PigStorage(',') AS (txnid:int, sndr_acct:int,sndr_cntry:chararray, rcvr_acct:int, rcvr_cntry:chararray);
data_txn1 = GROUP data_txn ALL;
data_txn2 = FOREACH data_txn1 GENERATE flatten(Enumerate(data_txn));
dump data_txn2;

after running this, I am getting

ERROR org.apache.pig.tools.pigstats.SimplePigStats – ERROR 2997: Unable to recreate exception from backed error: java.lang.NullPointerException
at datafu.pig.bags.Enumerate.enumerateBag(Enumerate.java:89)
at datafu.pig.bags.Enumerate.accumulate(Enumerate.java:104)
….

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-17T19:03:46+00:00Added an answer on June 17, 2026 at 7:03 pm

    I often assign random ids in Hadoop jobs. You just need to ensure you generate ids which contain a sufficient number of random bits to ensure the probability of collisions is sufficiently small (http://en.wikipedia.org/wiki/Birthday_problem).

    As a rule of thumb I use 3*log(n) random bits where n = # of ids that need to be generated.

    In many cases Java’s UUID.randomUUID() will be sufficient.

    http://en.wikipedia.org/wiki/Universally_unique_identifier#Random_UUID_probability_of_duplicates

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm working on a Pig script (my first) that loads a large text file.
I am working on a pig latin translator which translator the given word into
I'm working with Pig and I'm trying to store my results in a MySQL
I'm working on creating a pig latin translator in ruby. It works for most
I'm analyzing data with Apache pig and could not find a way to expand
When working with TeraBytes of data, and for a typical data filtering problem, is
Working on a little Ruby script that goes out to the web and crawls
Working on a PHP /html content management system and found Free RTE online which
Working with SQL server with isolation level read committed snapshot, we routinely write data
Working on various GNU Readline-based CLIs and it would dramatically speed me up if

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.