I’m currently giving the user an option to include stop words or not when filtering a body of text for ngram frequencies. Typically, this is done as follows:
snowballAnalyzer = new SnowballAnalyzer(Version.LUCENE_30, "English", stopWords);
shingleAnalyzer = new ShingleAnalyzerWrapper(snowballAnalyzer, this.getnGramLength());
stopWords is set to either a full list of words to include in ngrams or to remove from them. this.getnGramLength()); simply contains the current ngram length up to a maximum of three.
If I use stopwords in filtering text “satellite is definitely falling to Earth” for trigrams, the output is:
No=1, Key=to, Freq=1
No=2, Key=definitely, Freq=1
No=3, Key=falling to earth, Freq=1
No=4, Key=satellite, Freq=1
No=5, Key=is, Freq=1
No=6, Key=definitely falling to, Freq=1
No=7, Key=definitely falling, Freq=1
No=8, Key=falling, Freq=1
No=9, Key=to earth, Freq=1
No=10, Key=satellite is, Freq=1
No=11, Key=is definitely, Freq=1
No=12, Key=falling to, Freq=1
No=13, Key=is definitely falling, Freq=1
No=14, Key=earth, Freq=1
No=15, Key=satellite is definitely, Freq=1
But if I don’t use stopwords for trigrams, the output is this:
No=1, Key=satellite, Freq=1
No=2, Key=falling _, Freq=1
No=3, Key=satellite _ _, Freq=1
No=4, Key=_ earth, Freq=1
No=5, Key=falling, Freq=1
No=6, Key=satellite _, Freq=1
No=7, Key=_ _, Freq=1
No=8, Key=_ falling _, Freq=1
No=9, Key=falling _ earth, Freq=1
No=10, Key=_, Freq=3
No=11, Key=earth, Freq=1
No=12, Key=_ _ falling, Freq=1
No=13, Key=_ falling, Freq=1
Why am I seeing underscores? I would have thought to see simple unigrams, “satellite falling”, “falling earth” and “satellite falling earth”? Definitely is in the stopwords set I’m using.
I can just filter out the results with underscores but…
The underscores represent a ‘missing stop word/s’. To avoid that behavior you should set the
enablePositionIncrementstofalsebutSnowballAnalyzer(now deprecated in 4.0.0-Beta) doesn’t let you do it.One solution is using a StandardAnalyzer with no stop words first, and then decorate the output with a
StopFilter,SnowballFilter, andShingleFilter. Example for bi-grams in Lucene 4.0.0-Beta:Hope that this puts you on the right track!
EDIT
Not possible anymore with Lucene v4.4+, still looking for a nice alternative…