I am working with some spreadsheet data and I have a set of cell regions that are of arbitrary bounds. Given any cell, what is the fastest way to determine the subset of regions which contain the cell?
Currently, the best I have is to sort the regions with the primary sort field being the region’s starting row index, followed by its ending row index, starting column index, and then ending column index. When I want to search based on a given cell, I binary search to the first region whose starting row index is after the cell’s row index and then I check all regions before that one to see if they contain the cell, but this is too slow.
Based on some Googling, this is an example of the two dimensional point enclosure searching problem, or the “stabbing problem”. See:
http://www.cs.nthu.edu.tw/~wkhon/ds/ds10/tutorial/tutorial6.pdf
of here (starting at p.21/52):
http://www.cs.brown.edu/courses/cs252/misc/slides/orthsearch.pdf
The key data structure involved is the segment tree:
http://en.wikipedia.org/wiki/Segment_tree
For the 2-D case, it looks like you can build a segment tree containing segment trees and get O(log^2(n)) query complexity. (I think your current solution is O(n) since on average you’ll just exclude half of your regions with your binary search.)
However, you said “spreadsheet”, which means you’ve probably got a relatively small area to work with. More importantly, you’ve got integer coordinates. And you said “fastest”, which means you’re probably willing to trade space and setup time for a faster query.
You didn’t say which spreadsheet, but the code below is a wildly-inefficient, but dirt-simple, brute-force Excel/VBA implementation of a 2-D lookup table that, once set up, has O(1) query complexity:
If you have a larger grid to worry about or regions that are large relative to the grid, you can save a ton of space and setup time by using two 1-D lookup tables instead. However, then you have two lookups, plus a need to take the intersection of the two resulting sets.