I have an SQL db with about 200,000 words. I need a query which I will be able to solve an anagram kind of. The difference is that I need all the possible words that could be made with the input characters. For example, if you input ofdg, it should output the words: do, go, and dog. Can you estimate the amount of time a query like this would take. How can I make it faster and more efficient? Also, in general how long does it take SQL to parse a 200000 row database.
Share
To solve this problem, the first thing you need to do is reduce every word to what Scrabble players call an alphagram. That is, all the letters in the word but in alphabetical order. So
do,goanddogmakedo,goanddgo. Of course, any given alphagram may correspond to more than one word, so, for example, alphagramdgocorresponds to both the wordsdogandgod.The next thing you need to do is construct a table with a key alphagram-sequence number and a single attribute field word.
Word lists tend to be static. For example, the two Scrabble word lists in the English-speaking world change about every 5 years of so. So you construct this lookup table beforehand. Performance is O( n ) and it is a sunk cost. That is, you do it once and store it, so it is not counted against the cost of the query. You have to do this beforehand. It makes absolutely no sense to build such an index on the fly every time a query comes in.
You may be wondering “What is all this about Scrabble?” The answer is that your figure of 200,000 words falls neatly between the two approved tournament word lists in the English-speaking world. The US National Scrabble Association’s Official Tournament and Club Word List (2006) contains 178,691 words, and the international list, maintained by the World English Scrabble Players’ Association, contains 246,691.
When you get a query you reduce the supplied word to a bunch of alphagrams. Input
odfgmakes alphagramsodfogodfdgfgdfodgofgodfgdfgo(which is a pretty programming problem in pure SQL, so I have to assume there is a PHP or Python or JavaScript front-end that will do that for you). Then you do the lookup in the database. The cost of each query should be approximately O(log2 n), in other words pretty damn immediate. That sort of query is what relational databases are good at.BTW, your example output is poor. Alphagram
dfgowith what Scrabble players call ‘build’ (all possible subsets) makesdoodofgodoggodfog.(I hate to have to do this rigmarole, but Hasbro’s lawyers are touchy, so: Scrabble is a registered trademark owned in the USA by Hasbro, Inc.; in Canada by Hasbro Canada Corporation; and throughout the rest of the world by J. W. Spear & Sons, a Mattel Company.)