I have a large data set in the following format:
In total, there are 3687 object files. Each of which contains 2,000,000 records. Each file is 42MB in size.
Each record contains the following:
- An id (Integer value)
- Value1 (Integer)
- Value2 (Integer)
- Value3 (Integer)
The content of each file is not sorted or ordered in any way as they are observed during a data collection process.
Ideally, I want to build an index for this data. (Indexed by the id) which would mean the following:
-
Dividing the set of ids into manageable chunks.
-
Scanning the files to get data related to the current working set of ids.
-
Build the index.
-
Go over the next chunk and repeat 1,2,3.
To me this sounds fine but loading 152GB back and forth is time-consuming and wonder about the best possible approach or even whether Java is actually the right language to use for such a process.
I’ve 256GB of ram and 32 cores on my machine.
Update:
Let me modify this, putting aside I/O, and assuming the file is in-memory in a byte array.
What would be the fastest possible way to decode a 42MB Object file that have 2,000,000 records and each record contains 4 Integers serialized.
So, what I would do is just load up each file and store the id into some sort of sorted structure – std::map perhaps [or Java’s equivalent, but given that it’s probably about 10-20 lines of code to read in the filename and then read the contents of the file into a map, close the file and ask for the next file, I’d probably just write the C++ to do that].
I don’t really see what else you can/should do, unless you actually want to load it into a dbms – which I don’t think is at all unreasonable of a suggestion.