I have a HTML document, a list of common spelling mistakes, and the correct

Question

0

Asked: May 13, 20262026-05-13T07:29:14+00:00 2026-05-13T07:29:14+00:00

I have a HTML document, a list of common spelling mistakes, and the correct

0

I have a HTML document, a list of common spelling mistakes, and the correct spelling for each case.
The HTML documents will be up to ~50 pages and there are ~30K spelling correction entries.

What is an efficient way to correct all spelling mistakes in this HTML document?

(Note: my implementation will be in Python, in case you know of any relevant libraries.)

I have thought of 2 possibles approaches:

build hashtable of the spelling data
parse text from HTML
split text by whitespace into tokens
if token in spelling hashtable replace with correction
build new HTML document with updated text

This approach will fail for multi-word spelling corrections, which will exist. The following is a simpler though seemingly less efficient approach that will work for multi-words:

iterate spelling data
search for word in HTML document
if word exists replace with correction

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-13T07:29:15+00:00

You are correct that the first approach will be MUCH faster than the second (additionally, I would recommend looking into Tries instead of a straight hash, the space savings will be quite dramatic for 30k words).

To still be able to handle the multi-word cases, you could either keep track of the previous token and thereby check your hash for a combined string such as “prev cur”.

Or else you could leave the multi-word corrections out of the hash and combine your two approaches, first using the hash for single words and then doing a scan for the multi-word combos (or vice versa). This could still be relatively fast if the number of multi-word corrections is relatively small.

Be careful tho, pulling out word tokens is trickier than just splitting on whitespace. You don’t want to fail to correct an error simply because you didn’t find ‘instence,’ with a comma in your hash.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a HTML document, a list of common spelling mistakes, and the correct

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply