What is a fast and efficient way to implement the server-side component for an autocomplete feature in an html input box?
I am writing a service to autocomplete user queries in our web interface’s main search box, and the completions are displayed in an ajax-powered dropdown. The data we are running queries against is simply a large table of concepts our system knows about, which matches roughly with the set of wikipedia page titles. For this service obviously speed is of utmost importance, as responsiveness of the web page is important to the user experience.
The current implementation simply loads all concepts into memory in a sorted set, and performs a simple log(n) lookup on a user keystroke. The tailset is then used to provide additional matches beyond the closest match. The problem with this solution is that it does not scale. It currently is running up against the VM heap space limit (I’ve set -Xmx2g, which is about the most we can push on our 32 bit machines), and this prevents us from expanding our concept table or adding more functionality. Switching to 64-bit VMs on machines with more memory isn’t an immediate option.
I’ve been hesitant to start working on a disk-based solution as I am concerned that disk seek time will kill performance. Are there possible solutions that will let me scale better, either entirely in memory or with some fast disk-backed implementations?
Edits:
@Gandalf: For our use case it is important the the autocompletion is comprehensive and isn’t just extra help for the user. As for what we are completing, it is a list of concept-type pairs. For example, possible entries are [(“Microsoft”, “Software Company”), (“Jeff Atwood”, “Programmer”), (“StackOverflow.com”, “Website”)]. We are using Lucene for the full search once a user selects an item from the autocomplete list, but I am not yet sure Lucene would work well for the autocomplete itself.
@Glen: No databases are being used here. When I’m talking about a table I just mean the structured representation of my data.
@Jason Day: My original implementation to this problem was to use a Trie, but the memory bloat with that was actually worse than the sorted set due to needing a large number of object references. I’ll read on the ternary search trees to see if it could be of use.
With a set that large I would try something like a Lucene index to find the terms you want, and set a timer task that gets reset after every key stroke, with a .5 second delay. This way if a user types multiple characters fast it doesn’t query the index every stroke, only when the user pauses for a second. Useability testing will let you know how long that pause should be.
Some pseduocode there, but that’s the idea. Also if the query terms are set the Lucene Index can be pre-created and optimized.