Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9197035
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 17, 20262026-06-17T21:59:00+00:00 2026-06-17T21:59:00+00:00

I have a large data set in the following format: In total, there are

  • 0

I have a large data set in the following format:

In total, there are 3687 object files. Each of which contains 2,000,000 records. Each file is 42MB in size.

Each record contains the following:

  • An id (Integer value)
  • Value1 (Integer)
  • Value2 (Integer)
  • Value3 (Integer)

The content of each file is not sorted or ordered in any way as they are observed during a data collection process.

Ideally, I want to build an index for this data. (Indexed by the id) which would mean the following:

  1. Dividing the set of ids into manageable chunks.

  2. Scanning the files to get data related to the current working set of ids.

  3. Build the index.

  4. Go over the next chunk and repeat 1,2,3.

To me this sounds fine but loading 152GB back and forth is time-consuming and wonder about the best possible approach or even whether Java is actually the right language to use for such a process.

I’ve 256GB of ram and 32 cores on my machine.


Update:
Let me modify this, putting aside I/O, and assuming the file is in-memory in a byte array.

What would be the fastest possible way to decode a 42MB Object file that have 2,000,000 records and each record contains 4 Integers serialized.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-17T21:59:02+00:00Added an answer on June 17, 2026 at 9:59 pm

    So, what I would do is just load up each file and store the id into some sort of sorted structure – std::map perhaps [or Java’s equivalent, but given that it’s probably about 10-20 lines of code to read in the filename and then read the contents of the file into a map, close the file and ask for the next file, I’d probably just write the C++ to do that].

    I don’t really see what else you can/should do, unless you actually want to load it into a dbms – which I don’t think is at all unreasonable of a suggestion.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a large set of data which I access via a generator/iterator. While
I have a data set which is a large unweighted cyclic graph The cycles
I have a data set in the following format: snp,T2DG0200001,T2DG0200002,T2DG0200003,T2DG0200004 3_60162,AA,AA,AA,AA 3_61495,AA,AA,GA,GA 3_61466,GG,GG,CG,CG The
The scenario is the following. I have a plain text file which contains 2,000,000
I have a large data set and I want to write a custom merge
I have a large data set that I'm working with in excel. About 1000+
Interpolating Large Datasets I have a large data set of about 0.5million records representing
I have a very large possible data set that I am trying to visualize
I have a large real 1-d data set called r. I would like plot:
I have a large set of data that is generated from a web service

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.