I was recently asked a theoretical C question and I was wondering what the best way to approach it would be:
If I had a document with 10 words on it what would be the best way to determine if there were duplicate words and if there were duplicates how would I keep track of how many there were?
Any insight on how you would approach this would be great.
Theoretical interview questions like this always deal in small numbers (like 10 words). However, the number means nothing; it’s there to separate out those candidates who really can think around the problem in the general form from those who simply regurgitate fixed answers to fixed interview questions they find on the internet.
The best software houses will only favour solutions that are scalable. Therefore, you will gain top marks in an interview if your answer is simple, but also scalable to any size of problem (or, in this case, document). Therefore, sorting, loops inside loops, O(n^2) complexity, forget them all. If you presented any solutions like these to a leading-edge software company at interview you would fail.
Your particular question is checking to see if you know about Hash Tables. The most efficient solution to this problem can be written in pseudo-code as follows:
The most important benefit of the above solution is that only a single scan of the document is required. No reading words into memory and then processing (two scans), no loops in loops (many scans), no sorting (even more passes). After exactly one pass of the document, if you read out the keys in the hash table, the count of each word tells you exactly how many times each word appeared in the document. Any word with a count greater than one is a duplicate.
The secret to this solution is its use of hash tables. Generation of the hash key (step 2), key lookup (step 3), and key storage (step 5) can be implemented as near constant-time operations. This means the time these steps take hardly changes as the size of the input set (i.e. number of words) grows. It means that whether it’s the 10th word in a document, or the 10 millionth word, inserting that word into the hash table (or looking it up) will take roughly the same very small amount of time. In this case, we additionally keep a count of each word’s frequency in step 5. Incrementing a value is known to be a very efficient fixed-time operation.
Any solution to this problem must scan all words in the document at least once. As our solution processes each word exactly once, with all words taking approximately the same very small constant time to process, we say our solution performs optimally and scales linearly, yielding O(n) performance (put simply, processing 1,000,000 words will take around 1000 times longer than processing 1000 words). In all, a scalable and efficient solution to the problem