I will have to perform a spelling check-like operation in Python as follows:
I have a huge list of words (let’s call it the lexicon). I am now given some text (let’s call it the sample). I have to search for each sample word in the lexicon. If I cannot find it, that sample word is an error.
In short – a brute-force spelling checker. However, searching through the lexicon linearly for each sample word is bound to be slow. What’s a better method to do this?
The complicating factor is that neither the sample nor the lexicon is in English. It is in a language which instead of 26 characters, can have over 300 – stored in Unicode.
A suggestion of any algorithm / data structure / parallelization method will be helpful. Algorithms which have high speed at the cost of less than 100% accuracy would be perfect, since I don’t need 100% accuracy. I know about Norvig’s algorithm for this, but it seems English-specific.
You can use a set of Unicode strings:
and use the
inoperator to check whether a word occurs:This look-up is essentially O(1), so the size of the dictionary does not matter.
Edit: Here’s the complete code for a (case-sensitive) spell checker (2.6 or above):
(The
printassumes your terminal uses UTF-8.)