I have large data in two files each with about two million (different) entries. The structure of the file is such that there is an event number, and for each event, there are some subevents. Each of these subevents have some characteristics. For example, the general structure of the files is such:
Index Event SubEvent Characteristic1 Characteristic2 ....
1 1 1 322 234
2 1 2 453 324
3 1 3 ... ...
. . . ... ...
. . . ... ...
100 1 100 ... ...
101 2 1 ... ...
102 2 2 ... ...
. . . ... ...
. . . ... ...
. . . ... ...
207 2 107 ... ...
208 3 1 ... ...
209 3 2 ... ...
and so on, the index runs till about two million.
I have two Files, lets call them file1 and file2, with the above structure. I have to make some computations using their characteristics for each subevent of an event. Here’s the outline of what I have thought up.
LOOP over each INDEX in file1
LOOP over each INDEX in file2
if (Event value of file1 is same as event value of file2)
/* do some computations with characteristics and store them somewhere*/
The current implementation I have written
for (int i=0;i<nEntries_1;i++) {
file1->GetEntry(i);
for (int_t j=0; j < nEntries_2 ; j++) {
file2->GetEntry(j);
if (event1 != event2) break;
else {
/* Doing the computation with characteristics*/
}
}
}
However I think that this is wrong. Suppose we are at index 209 in the top file1 loop. Which means it needs to compute some characteristic for subevent 2 in event 3 of file1 with all the subevents of event3 in file2. However, the above code would break out of the loop as the event numbers of the first entry would not match.
What could be a possible solution. If I just do a brute force with no if-break command it takes way too long.
In your loop you have to say
continueto skip a round, rather than yourbreakwhich aborts the entire loop.Design-wise, your algorithm is extremely inefficient, as you can convince yourself by doing a basic complexity analysis. Indexing your data suitably would almost certainly be necessary.
This is exactly what databases are for. I recommend you rig up a small database (e.g. MySQL), make two tables and run a JOIN query on the data, which should be a lot more efficient than your manual loop.
Alternatively, if you like to give it a try yourself, you could build your own micro-database in C++ with a structure like
std::multimapand then useeuqal_range()to do targetted matching.