I’ve just set up Solr, indexed some pages (crawled using Nutch) and I can now search.
I now need to change it to index sentences instead of web pages. The result I need is, for example, to do a search for “one word” and get a list of all sentences that contain “one” and/or “word”.
I’m new to Solr so any pointers to where I should start from to achieve this would be extremely helpful. Is it at all possible? Or is there an easy way of doing this I’ve missed?
Yes. Solr indexes ‘documents’. You define what a document is by what you post to it via the REST-ful endpoint. If you push one sentence at a time, it indexes one sentence at a time.
If you meant, ‘can I push a document, have solr split into sentences and index each one individually’, then the answer is, I think, not very easily inside Solr. If you are using Nutch, I’d recommend putting the splitting into Nutch so that it presents solr with one sentence at a time.
Neither the analysis chain nor update request processors provide for splitting a document into littler documents. You might also contemplate the Elastic Search alternative, though I have no concrete knowledge that there’s a greased pole to slide down that leads to your solution there.