Suppose I’m given a large dictionary in flat file with 200 million words and my function needs to check the existence of any given word in the dictionary, what’s the fastest way to do it? You can’t store the dictionary in the memory because you only have 1GB of memory. You can store it in the database, however querying it would still be very very slow without any optimization. You can’t index the full words because you don’t have enough resources.
Edit: in addition to the file optimization approach mentioned below, are there any database optimization? I’m thinking of creating partial indices, say for every 2 letters in the word up to a limit, I create an index. Would this speed up the db query?
Binary search
Assuming the dictionary has the words in alphabetical order, I would attempt a modified binary search. Divide and conquer the file by jumping to a midpoint location in the file and seeing what word is there. If guessed to high, split the lower in half and try again until there’s no file location to attempt or the word is found.
(As outis mentioned in a comment, after jumping to a file location, you’ll need to scan backwards and forwards to find the boundaries of the word you jumped to.)
You might be able to optimize this by guessing a location chunk right off the bat based on the first letter of the word. For example, if the word begins with “c” start your search around the 3/26th section of the file. Though, in reality, I think this early guess will only make a negligible difference overall.
Other optimizations could include keeping a small subset of an index. For example, keep an index of the first word that starts with each letter of the alphabet, or keep an index of each word that starts with each possible two letter combination. This would allow you to immediately narrow your search area.
O(log n)