Surprisingly I’ve been unable to find anyone else really doing this, but surely someone

Question

0

Asked: May 16, 20262026-05-16T08:01:59+00:00 2026-05-16T08:01:59+00:00

Surprisingly I’ve been unable to find anyone else really doing this, but surely someone

0

Surprisingly I’ve been unable to find anyone else really doing this, but surely someone has. I’m working on a python project currently that involves spell checking some 16 thousand words. That number of words is only going to grow unfortunately. Right now I’m pulling words from Mongo, iterating through them, and then spell checking them with pyenchant. I’ve removed mongo as the potential bottleneck by grabbing all my items from there first. That leaves me with around 20 minutes to process through 16k words, which is obviously longer than I want to spend. This leaves me with a couple ideas/questions:

Obviously I could leverage threading or some form of parallelism. Even if I chop this into 4 pieces, I’m still looking at roughly 5 minutes assuming peak performance.
Is there a way to tell what spelling library Enchant is using underneath pyenchant? Enchant’s website seems to imply it’ll use all available spelling libraries/dictionaries when spell checking. If so, then I’m potentially running each word through three-four spelling dicts. This could be my issue right here, but I’m having a hard time proving that’s the case. Even if it is, is my option really to uninstall other libraries? Sounds unfortunate.

So, any ideas on how I can squeeze at least a bit more performance out of this? I’m fine with chopping this into parallel tasks, but I’d still like to get the core piece of it to be a bit faster before I do.

Edit: Sorry, posting before morning coffee… Enchant generates a list of suggestions for me if a word is incorrectly spelled. That would appear to be where I spend most of my time in this processing portion.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-16T08:01:59+00:00

I think we agree that the performance bottleneck here is Enchant; for this size of dataset it’s nearly instantaneous to do a boolean isSpeltCorrectly. So, why not:

Build a set in memory of correctly-spelt words, using the dictionaries that Enchant does or fetching your own (e.g. OpenOffice’s).

Optionally, uniquify the document’s words, say by putting them in a set. This probably won’t save you very much.
Check whether each word is in the set or not. This is fast, because it’s just a set lookup. (Probably O(log N) where N is the number of words? assuming set buckets by hash and does a binary search… a Python guru can correct me here.)
If it isn’t, then ask Enchant to recommend a word for it. This is necessarily slow.

This assumes that most of your words are spelt correctly; if they aren’t, you’ll have to be cleverer.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Surprisingly I’ve been unable to find anyone else really doing this, but surely someone

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply