Instead of writing your own profiling code, I suggest you…

Question

0

Asked: May 11, 20262026-05-11T00:44:11+00:00 2026-05-11T00:44:11+00:00

I’m am looking for specific suggestions or references to an algorithm and/or data structures

0

I’m am looking for specific suggestions or references to an algorithm and/or data structures for encoding a list of words into what would effectively would turn out to be a spell checking dictionary. The objectives of this scheme would result in a very high compression ratio of the raw word list into the encoded form. The only output requirement I have on the encoded dictionary is that any proposed target word can be tested for existence against the original word list in a relatively efficient manner. For example, the application might want to check 10,000 words against a 100,000 word dictionary. It is not a requirement for the encoded dictionary form to be able to be [easily] converted back into the original word list form – a binary yes/no result is all that is needed for each word tested against the resulting dictionary.

I am assuming the encoding scheme, to improve compression ratio, would take advantage of known structures in a given language such as singular and plural forms, possessive forms, contractions, etc. I am specifically interested in encoding mainly English words, but to be clear, the scheme must be able to encode any and all ASCII text ‘words’.

The particular application I have in mind you can assume is for embedded devices where non-volatile storage space is at a premium and the dictionary would be a randomly accessible read-only memory area.

EDIT: To sum up the requirements of the dictionary:

zero false positives
zero false negatives
very high compression ratio
no need for decompression

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

score 0 · Answer 1 · 2026-05-11T00:44:11+00:00

See McIlroy’s ‘Development of a Spelling List’ at his pubs page. Classic old paper on spellchecking on a minicomputer, which constraints map surprisingly well onto the ones you listed. Detailed analysis of affix stripping and two different compression methods: Bloom filters and a related scheme Huffman-coding a sparse bitset; I’d go with Bloom filters probably in preference to the method he picked, which squeezes a few more kB out at significant cost in speed. (Programming Pearls has a short chapter about this paper.)

See also the methods used to store the lexicon in full-text search systems, e.g. Introduction to Information Retrieval. Unlike the above methods this has no false positives.

How to approach applying for a job at a company ...

How to handle personal stress caused by utterly incompetent and ...

What is a programmer’s life like?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions