Quick question, is the porter stemmer from Lucene packages (Java) thread safe?
I’m guessing the answer is no as you need to set the current string, invoke stem method then get the current block to get the stemmed word. But perhaps I’m missing something – Is there are thread safe method to do stemming of a single word or string from Lucene?
Does anyone from experience know if it is faster to instantiate one Porter Stemmer instance and then use a synchronized block over that stemmer instance and do the setCurrent("..."); stem(); get(); routine or is it just faster to create a new porter stemmer instance for each string/document you want to process.
In this instance I have many 1000s of documents which are each taken up by a pool of threads (i.e. 1 thread has one document).
Edit FYI – Example usage pattern:
import org.tartarus.snowball.ext.PorterStemmer;
...
private String stem(String word){
PorterStemmer stem = new PorterStemmer();
stem.setCurrent(word);
stem.stem();
return stem.getCurrent();
}
Cheers!
Looking at the docs, it seems the
PorterStemmerclass is not re-entrant, so I’d build an instance per thread if I were you. If stemming is one of the main things your program does, and it has no other way of keeping your CPU cores busy, then a synchronized block seems like a bad idea: the program would be blocking all the time, waiting for the stemmer to finish one document. I wouldn’t create a single thread per document, either; a thread pool with one thread per core might be a wiser choice.(No example code since I couldn’t even figure out the usage from the API docs. RTFS to find out how this thing works…)