I am having a problem finding all overlapping ranges in two lists efficiently.
This problem is similar to This question, but with different inputs.
I have 2 input files, one that contains many lines of range and data pairs, and another that contains a list of ranges to find the intersections to.
I already wrote a file reader class that reads from the data file, returning objects, one at a time, that hold a list of range and data pairs, but am running into trouble when I try to find the overlaps of the two range lists.
Currently what I am doing is brute forcing it, comparing every range in the data list to every other range in the intersection list, but because the data file is very large, it is taking a long time.
Sample Objects:
This is the object in the data list:
public DataModel {
private int start; {set; get;}
private int end; {set; get;}
//Other Data
}
The range Model is just a list of paired integers (start, end).
while (fileParser.hasNext()) {
dataList = fileParser.next();
for (DataModel data : dataList)
for (RangeModel range : rangeList)
if(overlaps(data, range))
print(range.getString + " " + data.getString);
}
Edit for clarity:
The DataModel is given in smaller packets of similar ranges of varying length, but they are mostly under 20, so the comparison will be run repeatedly on the same RangeModel and each new DataModel. The total ranges in all the data is around 2 billion, but it doesn’t really matter. Thanks for the help.
I can think of different optimizations, but they depend on what kind of data you want available after the check.
Sorting both the data and the ranges and processing them in order provides an instant performance improvement, since it makes no sense to test a range starting in 100 against another one ending in 50.
Another improvement would be to ‘compress’ the ranges. If you have ranges like (1-10), (10-20), (20-30), then you could easily replace them with a single (1-30) range, and reduce the number of tests. You can create an appropiate AggregateRange class that keeps track of the identities of its composing ranges in case you still want to know which original range is causing the overlap.
Yet another improvement would be to smartly use the previous results as you process the data list. For example: Suppose you test data range (1-10) and it happens to not overlap. Were the next test data range be (2-8), you should not need to test it against the ranges, since your previous result guarantees that it will not overlap.
The basic idea behind this improvement would be to advance the start of any untested data ranges up to and including the end of the last non-overlaping data range. If the new start surpasses its own end, then no testing is required as it does not overlap.
This means non-overlaping (1-20) should transform an untested (10-100) into an untested (20-100). This may be trickier to implement, so be careful not to overdo it.