I have a large, strictly increasing array (10 million integers) of offsets for another, larger, data array. No element in data is greater than 50. For example,
unsigned char data[70*1000*1000] = {0,2,1,1,0,2,1,4,2, ...};
unsigned int offsets[10*1000*1000] = {0,1,2,4,6,7,8, ...};
Then I would like to find the count of each element in a series of ranges that are not known until runtime, including only elements whose offsets are included in the offsets array. The endpoints of each range refer to indices of the data array, not to the offsets. For example, the data for the range [1,4] would be:
1 zero
1 one
1 two
The results include only one “one” because, while both data[3] and data[2] are equal to one, 3 is not included in offsets.
I need to compute these binned counts for several hundred ranges, some of which span the entire array. I considered iterating through the data array to store a cumulative sum for each bin and element, but the memory requirements would have been prohibitive. Here is a simple version of my implementation:
for(int i=0; i<range_count; i++){
unsigned int j=0;
while(j<range_starts[i]) pi++;
while(j < 10000000 and data[j]<=range_ends[i]) bins[i][data[offsets[j++]]]++;
}
Is there any more efficient way to compute these counts?
While Ruben’s answer did improve the time of the counts by about half, it remained too slow for my application. I include my solution here for the curious.
First, I optimized by setting elements in the
dataarray not indexed byoffsetsto an unused value (51, for example). This removed the need to track offsets, because I could simply ignore the contents of the 51st bins when reporting results.While I mentioned in the answer that storing cumulative counts for each bin and element would require too much memory, I was able to store the cumulative counts for each bin and range endpoint in linear time. Then, for each range, I calculated the occurrences of each element by subtracting the cumulative count for that element at the left endpoint of the range from the count at the right endpoint. Here is what I used: