I am working on a somewhat large corpus with articles numbering the tens of thousands. I am currently using PDFBox to extract with various success, and I am looking for a way to programatically check each file to see if the extraction was moderately successful or not. I’m currently thinking of running a spellchecker on each of them, but the language can differ, I am not yet sure which languages I’m dealing with. Natural language detection with scores may also be an idea.
Oh, and any method also has to play nice with Java, be fast and relatively quick to integrate.
Try an automatically learning spell checker. That’s not as scary as it sounds: Start with a big dictionary containing all the words you’re likely to encounter. This can be from several languages.
When scanning a PDF, allow for a certain number of unknown words (say 5%). If any of these words are repeated often enough (say 5 times), add them to the dictionary. If the PDF contains more than 5% unknown words, it’s very likely something that couldn’t be processed.
The scanner will learn over time allowing you to reduce the amount of unknown words if that should be necessary. If that is too much hazzle, a very big dictionary should work well, too.
If you don’t have a dictionary, manually process a couple of documents and have the scanner learn. After a dozen files or so, your new dictionary should be large enough for a reasonable water level.