I’m interested in ideas for identifying whether any given body of text contains valid, actual words, or just gibberish text.
The problem I run into immediately is that it needs to be language-agnostic, as the data we deal with is highly international. This means either a statistical approach, or an extremely large, multi-lingual hash table approach.
The multi-lingual hash tables seem straightforward, but unwieldy and possibly quite slow. (Or at the very least, a compromise between speed and accuracy.)
However, I don’t really have a background in the statistical approaches that would be useful to me in this situation, and would very much appreciate anyone’s experience or input, or any other suggestions.
You could use ngram analysis to compare your text with an example text. This could either be on characters or words.
Google’s NGram Viewer can help visualize what I mean. As an example, if I search for “haddock refrigerator” then there are no occurrences (e.g. it’s gibberish), whereas “stack overflow” shows occurrences came into prominence once computers did.