I’m am looking for specific suggestions or references to an algorithm and/or data structures for encoding a list of words into what would effectively would turn out to be a spell checking dictionary. The objectives of this scheme would result in a very high compression ratio of the raw word list into the encoded form. The only output requirement I have on the encoded dictionary is that any proposed target word can be tested for existence against the original word list in a relatively efficient manner. For example, the application might want to check 10,000 words against a 100,000 word dictionary. It is not a requirement for the encoded dictionary form to be able to be [easily] converted back into the original word list form – a binary yes/no result is all that is needed for each word tested against the resulting dictionary.
I am assuming the encoding scheme, to improve compression ratio, would take advantage of known structures in a given language such as singular and plural forms, possessive forms, contractions, etc. I am specifically interested in encoding mainly English words, but to be clear, the scheme must be able to encode any and all ASCII text ‘words’.
The particular application I have in mind you can assume is for embedded devices where non-volatile storage space is at a premium and the dictionary would be a randomly accessible read-only memory area.
EDIT: To sum up the requirements of the dictionary:
- zero false positives
- zero false negatives
- very high compression ratio
- no need for decompression
See McIlroy’s ‘Development of a Spelling List’ at his pubs page. Classic old paper on spellchecking on a minicomputer, which constraints map surprisingly well onto the ones you listed. Detailed analysis of affix stripping and two different compression methods: Bloom filters and a related scheme Huffman-coding a sparse bitset; I’d go with Bloom filters probably in preference to the method he picked, which squeezes a few more kB out at significant cost in speed. (Programming Pearls has a short chapter about this paper.)
See also the methods used to store the lexicon in full-text search systems, e.g. Introduction to Information Retrieval. Unlike the above methods this has no false positives.