Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8320707
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 8, 20262026-06-08T22:44:59+00:00 2026-06-08T22:44:59+00:00

I integrated Tika with Solr following the instructions provided in this link Correct me

  • 0

I integrated Tika with Solr following the instructions provided in this link

Correct me if I am wrong, it seems to me that it can index the document files(pdf,doc,audio) located on my own system (given the path of directory in which those files are stored), but cannot index those files, located on internet, when I crawl some sites using nutch.

Can I index the documents files(pdf,audio,doc,zip) located on the web using Tika?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-08T22:45:01+00:00Added an answer on June 8, 2026 at 10:45 pm

    There are basically two ways to index binary documents within Solr, both with Tika:

    1. Using Tika on the client side to extract information from binary files and then manually indexing the extracted text within Solr
    2. Using ExtractingRequestHandler through which you can upload the binary file to the Solr server so that Solr can do the work for you. This way tika is not required on the client side.

    In both cases you need to have the binary documents on the client side. While crawling, nutch should be able to download binary files, use Tika to generate text content out of them and then index data in Solr as it’d normally do with text documents. Nutch already uses Tika, I guess it’s just a matter of configuring the type of documents you want to index changing the regex-urlfilter.txt nutch config file by removing from the following lines the file extensions that you want to index.

    # skip some suffixes
    -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
    

    This way you would use the first option I mentioned. Then you need to enable the Tika plugin on nutch within your nutch-site.xml, have a look at this discussion from the nutch mailing list.

    This should theoretically work, let me know if it doesn’t.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am using Tika integrated in SOLR to index documents and allow search on
VIM seems integrated to the terminal. Can I open a remote file from the
I integrated this feature in my favoriate language OCaml, I know that this is
Language Integrated Query. Now I know that the acronyms are. I have seen C#
We integrated a wysiwyg editor in our website. Now we got the problem that
I'm using JavaFX integrated HTMLEditor. All the functions that it has are fine but
The integrated Web Deployment in Visual Studio 2010 is pretty nice. It can create
So I've integrated the Paypal in my payment flow and this is what happens:
I've integrated the Google +1 button into my site ( link ), but for
Has anyone integrated with Google Latitude (or any Google API, for that matter) with

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.