Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 4048046
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 20, 20262026-05-20T13:46:05+00:00 2026-05-20T13:46:05+00:00

Can you use ExtractingRequestHandler and Tika with any of the compressed file formats (zip,

  • 0

Can you use ExtractingRequestHandler and Tika with any of
the compressed file formats (zip, tar, gz, etc) to extract the content out for indexing?

I am sending solr the archived.tar file using curl. curl ”
http://localhost:8983/solr/update/extract?literal.id=doc1&fmap.content=body_texts&commit=true”
-H ‘Content-type:application/octet-stream’ –data-binary
“@/home/archived.tar”
The result I get when I query the document is that the file names inside the
archive are indexed as the “body_texts”, but the content of those files is
not extracted or included. This is not the behavior I expected. Ref:
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#article.tika.example.
When I send 1 of the actual documents inside the archive using the same curl
command the extracted content is then stored in the “body_texts” field. Am
I missing a step for the compressed files?

I have added all the extraction dependencies as indicated by mat in
http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-cell and
am able to successfully extract data from MS Word, PDF, HTML documents.

I’m using the following library versions.
Solr 1.40, Solr Cell 1.4.1, with Tika Core 0.4

Given everything I have read this version of Tika should support extracting
data from all files within a compressed file. Any help or suggestions would
be appreciated.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-20T13:46:05+00:00Added an answer on May 20, 2026 at 1:46 pm

    The short answer: Solr Cell 1.4.1 and Tika Core 0.6.

    The long answer: After a lot of headaches I was able to get this working. I’ll answer it for both people using solr directly and for people using solr with the Ruby library sunspot (which was my problem).

    Here was what I did: I used this https://github.com/tomasc/sunspot_cell plugin to extend sunspot and give it the attachment feature. (Ignore this step if you’re not using ruby/sunspot)

    v1.4.1 works for individual files but not with compressed files, so I had to explore a bit. I downloaded the v1.4.1 codebase from http://lucene.apache.org/solr/ and grabbed the dist/apache-solr-cell-1.4.1.jar then I had to pull down the Tika libraries from the 1.5 branch http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.5-dev/contrib/extraction/lib/.

    You can download each individually, or you can use svn to checkout the branch by

    svn co http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.5-dev
    

    Or just checkout the library folder:

    svn co http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.5-dev/contrib/extraction/lib/
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

can use anything in any order? does placing of <meta http-equiv=Content-Type content=text/html;charset=UTF-8> is important
I can use this maven plugin maven-jaxb-plugin to generate Java Classes from XSD file.
I can use stat() to figure out what permissions the owner, group, or others
I can use ipcs(1) to list out the active shared memory objects on a
I can use .map(func) on any column in a df, like: df = DataFrame({'a':[1,2,3,4,5,6],
I can use File('foo.bar').abspath to get the location of a file, but if I've
Can i use php to convert video file to .flv or .swf file ...
HBase can use HDFS as back-end distributed file system. However, their default block size
We can use polling to find out about updates from some source, for example,
We can use file.onchange if we gonna set an event callback for file reading

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.