I’m configuring Jackrabbit 2.3.6 and I need to index binary files (PDF,
ODT). So I’ve configured SearchIndex in repository.xml according to
http://wiki.apache.org/jackrabbit/Search. But when I insert file into repository and try to full-text
search, no results are returned.
Then I noticed warning in logs:
SearchIndex.java:2087 The textFilterClasses configuration parameter has
been deprecated, and the configured value will be ignored: org.apache.jackrabbit.extractor.PlainTextExtractor,org.apache.jackrabbit.extractor.PdfTextExtractor,org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
How do I have to configure SearchIndex to index binary data? Now I am
doing it like this, which is deprecated and didn’t work according to aforementioned warning:
<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
<param name="path" value="${rep.home}/repository/index"/>
<param name="textFilterClasses"value="org.apache.jackrabbit.extractor.PdfTextExtractor,org.apache.jackrabbit.extractor.OpenOfficeTextExtractor"/>
<param name="supportHighlighting" value="true"/>
</SearchIndex>
Thanks for replies.
This is the answer to similar question from Mark Herman from Jackrabbit Users mailing list:
I’m not an expert but what I do know that JR uses Tika to extract text, and
it determines how based on the jcr:mimeType property. If you don’t supply
mimetype, then it won’t know how to extract it (although I wouldn’t
recommend that as a practice). I believe there is a way to supply JR with a
Tika config that might give you what you want. EDIT: There isn’t. It’s hardcoded.
Additionally you can specify a indexing config in the repository/workspace
xml files that you can set some rules on what gets indexed and how by
lucene.