I integrated Tika with Solr following the instructions provided in this link Correct me

Question

0

Asked: June 8, 20262026-06-08T22:44:59+00:00 2026-06-08T22:44:59+00:00

I integrated Tika with Solr following the instructions provided in this link Correct me

0

I integrated Tika with Solr following the instructions provided in this link

Correct me if I am wrong, it seems to me that it can index the document files(pdf,doc,audio) located on my own system (given the path of directory in which those files are stored), but cannot index those files, located on internet, when I crawl some sites using nutch.

Can I index the documents files(pdf,audio,doc,zip) located on the web using Tika?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-08T22:45:01+00:00

There are basically two ways to index binary documents within Solr, both with Tika:

Using Tika on the client side to extract information from binary files and then manually indexing the extracted text within Solr
Using ExtractingRequestHandler through which you can upload the binary file to the Solr server so that Solr can do the work for you. This way tika is not required on the client side.

In both cases you need to have the binary documents on the client side. While crawling, nutch should be able to download binary files, use Tika to generate text content out of them and then index data in Solr as it’d normally do with text documents. Nutch already uses Tika, I guess it’s just a matter of configuring the type of documents you want to index changing the regex-urlfilter.txt nutch config file by removing from the following lines the file extensions that you want to index.

# skip some suffixes
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

This way you would use the first option I mentioned. Then you need to enable the Tika plugin on nutch within your nutch-site.xml, have a look at this discussion from the nutch mailing list.

This should theoretically work, let me know if it doesn’t.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I integrated Tika with Solr following the instructions provided in this link Correct me

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply