We have the following Solr (3.4) schema for indexing html/text documents:
<fields>
<field name="text" type="text" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="title" type="text" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="created" type="date" indexed="true"
stored="true" required="true" multiValued="false"
omitNorms="false"/>
<field name="modified" type="date" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="filesize" type="integer" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="mimetype" type="string" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="id" type="string" indexed="true"
stored="true" required="true" multiValued="false"
omitNorms="false"/>
<field name="tag" type="string" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<field name="relpath" type="string" indexed="true"
stored="true" required="false" multiValued="false"
omitNorms="false"/>
<dynamicField name="tika_*" type="ignored" />
</fields>
The configurations are auto-generated from templates from the solrinstance recipe for zc.buildout.
Now we need to import/index PDF/Office files etc. into Solr for fulltext indexing.
The generated requestHandler for the extraction is:
<requestHandler name="/update/extract"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="fmap.text">tika_content</str>
<str name="lowernames">false</str>
<str name="uprefix">tika_</str>
</lst>
</requestHandler>
But after uploading a PDF file through curl I can not find any indication that it
has been index (no changes in the document stats etc.).
What is the trick here?
[Update]
I am using
curl “http://localhost:8983/solr/update/extract?literal.id=2&commit=true&fmap.content=text” -F “myfile=@1.pdf”
to upload a PDF file. Having adding fmap.content=text seems to do the desired mapping (overriding the generated configuration).
This seems to have solved the problem.
fmap is basically field mapping for the content generated by tika.
Tika handler extracts the content of the document uploaded and assigns it to the field name
content.<str name="fmap.content">text</str>maps the content field to the text field defined in the schema.As you have
textfield defined in the schema, this will work.However, for
<str name="fmap.text">tika_content</str>there is not fieldtika_contentdefined nor I think thetextgets generated, so would not result in any matches.