I have a huge set data records in disk that arranged in sorted order based on some key(s).
The data is read into memory a block (thousands of records) at a time.
I have to search and display all records matching a key.
I was thinking of some binary search based algorithm, but I have some restrictions here.
- Records can be only sequentially looked up within a block from the start of the block.
- Records with the same key can span multiple blocks (as shown in the figure – 8 spans). In binary search, if I am loading the middle block and if the first record matches, then I have to
scan the blocks previous to the matched block.
Can someone help me devise an efficient strategy that could work in C++. Will it be efficient to go with the linear search method.
+---+
| 1 | Block1
| 3 |
| 3 |
| 4 |
+---+
| 4 | Block2
| 6 |
| 7 |
| 8 |
+---+
| 8 | Block3
| 8 |
| 8 |
| 8 |
+---+
| 8 | Block4
| 14|
| 15|
| 16|
+---+
You could build a secondary array that consists of the first entry in each block then run binary search on that array. The indices for the array should corresponding directly with the block indices making it an O(1) lookup to get the corresponding block.
It cuts the worst case from O(n) to O(logn) and is still relatively simple.