I have large data in two files each with about two million (different) entries.

Question

0

Asked: May 25, 20262026-05-25T00:10:17+00:00 2026-05-25T00:10:17+00:00

I have large data in two files each with about two million (different) entries.

0

I have large data in two files each with about two million (different) entries. The structure of the file is such that there is an event number, and for each event, there are some subevents. Each of these subevents have some characteristics. For example, the general structure of the files is such:

Index  Event     SubEvent      Characteristic1          Characteristic2 .... 
  1      1            1                 322                      234
  2      1            2                 453                      324
  3      1            3                 ...                      ...
  .      .            .                 ...                      ...
  .      .            .                 ...                      ...
 100     1           100                ...                      ...
 101     2            1                 ...                      ...
 102     2            2                 ...                      ...
  .      .            .                 ...                      ...
  .      .            .                 ...                      ...   
  .      .            .                 ...                      ... 
 207     2           107                ...                      ...
 208     3            1                 ...                      ...
 209     3            2                 ...                      ...

and so on, the index runs till about two million.

I have two Files, lets call them file1 and file2, with the above structure. I have to make some computations using their characteristics for each subevent of an event. Here’s the outline of what I have thought up.

LOOP over each INDEX in file1
LOOP over each INDEX in file2
if (Event value of file1 is same as event value of file2)
/* do some computations with characteristics and store them somewhere*/

The current implementation I have written

for (int i=0;i<nEntries_1;i++)  {
        file1->GetEntry(i);
         for (int_t j=0; j < nEntries_2 ; j++)    {
                file2->GetEntry(j);
                if (event1 != event2) break;
               else {
               /* Doing the computation with characteristics*/
               }      
               }
               }

However I think that this is wrong. Suppose we are at index 209 in the top file1 loop. Which means it needs to compute some characteristic for subevent 2 in event 3 of file1 with all the subevents of event3 in file2. However, the above code would break out of the loop as the event numbers of the first entry would not match.

What could be a possible solution. If I just do a brute force with no if-break command it takes way too long.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T00:10:17+00:00

In your loop you have to say continue to skip a round, rather than your break which aborts the entire loop.

Design-wise, your algorithm is extremely inefficient, as you can convince yourself by doing a basic complexity analysis. Indexing your data suitably would almost certainly be necessary.

This is exactly what databases are for. I recommend you rig up a small database (e.g. MySQL), make two tables and run a JOIN query on the data, which should be a lot more efficient than your manual loop.

Alternatively, if you like to give it a try yourself, you could build your own micro-database in C++ with a structure like std::multimap and then use euqal_range() to do targetted matching.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have large data in two files each with about two million (different) entries.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply