I’m interested in ideas for identifying whether any given body of text contains valid,

Question

0

Asked: June 8, 20262026-06-08T20:17:07+00:00 2026-06-08T20:17:07+00:00

I’m interested in ideas for identifying whether any given body of text contains valid,

0

I’m interested in ideas for identifying whether any given body of text contains valid, actual words, or just gibberish text.

The problem I run into immediately is that it needs to be language-agnostic, as the data we deal with is highly international. This means either a statistical approach, or an extremely large, multi-lingual hash table approach.

The multi-lingual hash tables seem straightforward, but unwieldy and possibly quite slow. (Or at the very least, a compromise between speed and accuracy.)

However, I don’t really have a background in the statistical approaches that would be useful to me in this situation, and would very much appreciate anyone’s experience or input, or any other suggestions.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-08T20:17:09+00:00

Editorial Team

2026-06-08T20:17:09+00:00Added an answer on June 8, 2026 at 8:17 pm

You could use ngram analysis to compare your text with an example text. This could either be on characters or words.

Google’s NGram Viewer can help visualize what I mean. As an example, if I search for “haddock refrigerator” then there are no occurrences (e.g. it’s gibberish), whereas “stack overflow” shows occurrences came into prominence once computers did.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m interested in ideas for identifying whether any given body of text contains valid,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply