What’s wrong with this code… I am trying to parse pdf files and extract the text from it… But for some pdf I am able to extract the text… And for some it throws the error
Invalid dictionary, found: '' but expected: '/'
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@67fb878
And also I didn’t get any metadata values in md variable for some pdf… But for Some I get that…
This is my code..!!
Some problem with the ByteArray??
private BinaryParser binaryParser;
binaryParser.parse(page.getBinaryData());
public void parse(byte[] data) {
InputStream is = null;
try {
is = new ByteArrayInputStream(data);
text = null;
Metadata md = new Metadata();
metaData = new HashMap<String, String>();
text = tika.parseToString(is, md).trim();
processMetaData(md);
} catch (Exception e) {
e.printStackTrace();
} finally {
IOUtils.closeQuietly(is);
}
}
private void processMetaData(Metadata md){
if ((getMetaData() == null) || (!getMetaData().isEmpty())) {
setMetaData(new HashMap<String, String>());
}
for (String name : md.names()){
getMetaData().put(name.toLowerCase(), md.get(name));
}
}
Tika is not perfect. It will have problems on many PDF files (unless a lot has changed in the last year). Make sure you are using an updated version of Tika. When I was using Tika it was at version 0.8 (9 months ago). There was a bug at this version that caused PDF parsing to be particularly problematic. I sidestepped the issue by using PDFBox, which Apache Tika wraps. There is some of my code wrapping PDFBox at the end of this post in case you decide to try this route.
If nothing else, using PDFBox directly will give you more control over the parameters. One such parameter is to handle “beaded” text. For example, a newspaper with columns is beaded whereas a letter is not. PDFBox can attempt to maintain the flow of writing, but it doesn’t always do a great job. If you are not extracting text from non-beaded PDFs, you might want to disable this feature.
You may also want to try the program pdftotext. Once again, make sure you have the latest version. With all PDF-to-text converters, performance changes rapidly with the version!