Is there any chance that I could use Lucene’s ShingleAnalyzerWrapper to generate bigrams taking into account punctuation signs (i.e:.\,\;)? Quick example: given the field “one two; three four” would provide 2 bigrams only: (one two) and (three four)?
Is there any chance that I could use Lucene’s ShingleAnalyzerWrapper to generate bigrams taking
Share
You could create a
ShingleAnalyzerWrapperthat uses an analyzer based onLetterTokenizer.LetterTokenizerbreaks the input text at non letters. Something like:If you wanted better control over what you consider punctuation, you could subclass
CharTokenizerand override theisTokenChar()method.