Can you use ExtractingRequestHandler and Tika with any of the compressed file formats (zip,

Question

0

Asked: May 20, 20262026-05-20T13:46:05+00:00 2026-05-20T13:46:05+00:00

Can you use ExtractingRequestHandler and Tika with any of the compressed file formats (zip,

0

Can you use ExtractingRequestHandler and Tika with any of
the compressed file formats (zip, tar, gz, etc) to extract the content out for indexing?

I am sending solr the archived.tar file using curl. curl ”
http://localhost:8983/solr/update/extract?literal.id=doc1&fmap.content=body_texts&commit=true”
-H ‘Content-type:application/octet-stream’ –data-binary
“@/home/archived.tar”
The result I get when I query the document is that the file names inside the
archive are indexed as the “body_texts”, but the content of those files is
not extracted or included. This is not the behavior I expected. Ref:
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#article.tika.example.
When I send 1 of the actual documents inside the archive using the same curl
command the extracted content is then stored in the “body_texts” field. Am
I missing a step for the compressed files?

I have added all the extraction dependencies as indicated by mat in
http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-cell and
am able to successfully extract data from MS Word, PDF, HTML documents.

I’m using the following library versions.
Solr 1.40, Solr Cell 1.4.1, with Tika Core 0.4

Given everything I have read this version of Tika should support extracting
data from all files within a compressed file. Any help or suggestions would
be appreciated.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-20T13:46:05+00:00

The short answer: Solr Cell 1.4.1 and Tika Core 0.6.

The long answer: After a lot of headaches I was able to get this working. I’ll answer it for both people using solr directly and for people using solr with the Ruby library sunspot (which was my problem).

Here was what I did: I used this https://github.com/tomasc/sunspot_cell plugin to extend sunspot and give it the attachment feature. (Ignore this step if you’re not using ruby/sunspot)

v1.4.1 works for individual files but not with compressed files, so I had to explore a bit. I downloaded the v1.4.1 codebase from http://lucene.apache.org/solr/ and grabbed the dist/apache-solr-cell-1.4.1.jar then I had to pull down the Tika libraries from the 1.5 branch http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.5-dev/contrib/extraction/lib/.

You can download each individually, or you can use svn to checkout the branch by

svn co http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.5-dev

Or just checkout the library folder:

svn co http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.5-dev/contrib/extraction/lib/

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Can you use ExtractingRequestHandler and Tika with any of the compressed file formats (zip,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply