Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9071037
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 16, 20262026-06-16T17:49:15+00:00 2026-06-16T17:49:15+00:00

Currently, I have Pig script running on top of Amazon EMR to load a

  • 0

Currently, I have Pig script running on top of Amazon EMR to load a bunch of files from S3 and then I will do the filter processing and group the data into phone number, so the data will be like (phonenumber:chararray, bag:{mydata:chararray}). Next I will have to store each phone number into different S3 buckets (possibly buckets in different accounts that I have access to). Seems org.apache.pig.piggybank.storage.MultiStorage is the best use at here, but it doesn’t work, as there are 2 problems I am facing:

  1. There are a lot of phone numbers (approximate 20,000), to store each
    phone number into different S3 buckets is very very slow and the
    program is even out of memory.
  2. There is no way for me to look
    up my lookup table to decide where is the buckets to store into.

So I am wondering if anyone can help out? The second one probably can solve by written up my own UDF store function, but for the first one, how to solve it? Thanks.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-16T17:49:16+00:00Added an answer on June 16, 2026 at 5:49 pm

    S3 is limited to 100 buckets per account. More than that, the creation of a bucket is not immediate, as you need to wait for the bucket to be ready.

    However you can have as many objects as you want in a bucket. You can write the phone numbers as different object relatively quick. Especially if you are taking care at the name of your objects: objects in S3 are stored by prefix. If you are giving all your objects that same prefix, S3 will try to put all of them on the same “hot” area, getting less performance. If you choose the prefix to be different (usually simply reverse the id or time), you will improve it significantly.

    You can also take a look at DynamoDB, which is a scalable NoSQL DB in AWS. You can get very high throughput for the time of building your index. You can later export it to S3 as well, using Hive over Elastic MapReduce.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Currently have a drop down menu that is activated on a hover (from display:none
I am calling a UDF written in Java from a Pig script. In the
I'm using Pig in an java application. Currently I have a thread that runs
Currently have this to get a value from the registry in TSQL. However, I
I currently have an entity model with a bunch of deleted items, the state
I currently have a single jQuery script page that I include in my ScriptManager
Currently have a Magento install running which seems to be printing #.UD3vymhSSYV at the
I currently have a uitableview in place. The data is obtained from an sqlite
Currently have : LOAD DATA LOCAL INFILE '/Users/RkG/Desktop/Microblogs.csv' INTO TABLE blogs This is an
Currently I have 3 servers running, 2 remote, Main server running MySQL 5.5.24 and

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.