I integrated Tika with Solr following the instructions provided in this link
Correct me if I am wrong, it seems to me that it can index the document files(pdf,doc,audio) located on my own system (given the path of directory in which those files are stored), but cannot index those files, located on internet, when I crawl some sites using nutch.
Can I index the documents files(pdf,audio,doc,zip) located on the web using Tika?
There are basically two ways to index binary documents within Solr, both with Tika:
In both cases you need to have the binary documents on the client side. While crawling, nutch should be able to download binary files, use Tika to generate text content out of them and then index data in Solr as it’d normally do with text documents. Nutch already uses Tika, I guess it’s just a matter of configuring the type of documents you want to index changing the regex-urlfilter.txt nutch config file by removing from the following lines the file extensions that you want to index.
This way you would use the first option I mentioned. Then you need to enable the Tika plugin on nutch within your nutch-site.xml, have a look at this discussion from the nutch mailing list.
This should theoretically work, let me know if it doesn’t.