I’m trying to check spelling accuracy of text samples using the Stanford NLP. It’s

Question

0

Asked: May 13, 20262026-05-13T09:09:53+00:00 2026-05-13T09:09:53+00:00

I’m trying to check spelling accuracy of text samples using the Stanford NLP. It’s

0

I’m trying to check spelling accuracy of text samples using the Stanford NLP. It’s just a metric of the text, not a filter or anything, so if it’s off by a bit it’s fine, as long as the error is uniform.

My first idea was to check if the word is known by the lexicon:

private static LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz");

@Analyze(weight=25, name="Spelling")
    public double spelling() {
        int result = 0;

        for (List<? extends HasWord> list : sentences) {
            for (HasWord w : list) {
                if (! lp.getLexicon().isKnown(w.word())) {
                    System.out.format("misspelled: %s\n", w.word());
                    result++;
                }
            }
        }

        return result / sentences.size();
    }

However, this produces quite a lot of false positives:

misspelled: Sincerity
misspelled: Sisyphus
misspelled: Sisyphus
misspelled: fidelity
misspelled: negates
misspelled: gods
misspelled: henceforth
misspelled: atom
misspelled: flake
misspelled: Sisyphus
misspelled: Camus
misspelled: foandf
misspelled: foandf
misspelled: babby
misspelled: formd
misspelled: gurl
misspelled: pregnent
misspelled: babby
misspelled: formd
misspelled: gurl
misspelled: pregnent
misspelled: Camus
misspelled: Sincerity
misspelled: Sisyphus
misspelled: Sisyphus
misspelled: fidelity
misspelled: negates
misspelled: gods
misspelled: henceforth
misspelled: atom
misspelled: flake
misspelled: Sisyphus

Any ideas on how to do this better?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-13T09:09:54+00:00

Using the parser’s lexicon’s isKnown(String) method as a spellchecker isn’t a viable use case of the parser. The method is correct: “false” means that this word was not seen (with the given capitalization) in the approximately 1 million words of text the parser is trained from. But 1 million words just isn’t enough text to train a comprehensive spellchecker from in a data-driven manner. People would typically use at least two orders of magnitude of text more, and might well add some cleverness to handle capitalization. The parser includes some of this cleverness to handle words that were unseen in the training data, but this isn’t reflected in what the isKnown(String) method returns.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to check spelling accuracy of text samples using the Stanford NLP. It’s

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply