- I am working on retrieving the readable content (i.e. text) from PDF documents, most of which are scientific journal articles.
- I am using the Poppler text utilities to convert the PDF to text format.
- The text is extracted nicely, but unfortunately so are other components of the articles (e.g. numerical tables), which cannot be rendered properly in plain text.
-
For example, I might get the following output in the middle of the article:
Character distributions random Hmax
1 2 3 4
Organization c) (of characters over species
A
B
A 0 0 0 + C
B + + + +
C + + + + A
B 4+
H Character distributions nonrandom Hobs
Entropy
3+ 2+ 1+
(diversity of characters over species
My question is: how would I identify such “noise” and differentiate it from normal blocks of text? Are there any existing algorithms? I am working in Ruby, but code in any language will help.
You could use a Naive Bayes Classifier to model valid vs. non-valid lines.
Here’s an article on one in Ruby; there’s a good implementation in Python’s nltk.
To set it up you would need to give it examples, for example by filling one file with good lines and one with bad ones. This is the same model used by spam filters.
One trick for this use case is that many basic Naive Bayes Classifiers word using a word-occurrence model for features, whereas here it’s not the vocabulary that’s significant. You may with to use line length, percent spaces (rounded to 5% or 10% intervals), or percent of various punctuation marks (rounded but with higher precision). Hopefully your classifier will learn that “lines with no periods and 30% spaces are bad” or “lines with no punctuation where every word begins with a capital letter are bad”.
Based on just your examples above, though, you could probably reject any line with too high a ratio of spaces or those completely lacking in sentence punctuation such as commas and periods.