My question in a nutshell: Does anyone know of a TwitterAnalyzer or TwitterTokenizer for

Question

0

Editorial Team

Asked: May 14, 20262026-05-14T04:12:10+00:00 2026-05-14T04:12:10+00:00

My question in a nutshell: Does anyone know of a TwitterAnalyzer or TwitterTokenizer for

0

My question in a nutshell: Does anyone know of a TwitterAnalyzer or TwitterTokenizer for Lucene?

More detailed version:

I want to index a number of tweets in Lucene and keep the terms like @user or #hashtag intact. StandardTokenizer does not work because it discards the punctuation (but it does other useful stuff like keeping domain names, email addresses or recognizing acronyms). How can I have an analyzer which does everything StandardTokenizer does but does not touch terms like @user and #hashtag?

My current solution is to preprocess the tweet text before feeding it into the analyzer and replace the characters by other alphanumeric strings. For example,

String newText = newText.replaceAll("#", "hashtag");
newText = newText.replaceAll("@", "addresstag");

Unfortunately this method breaks legitimate email addresses but I can live with that. Does that approach make sense?

Thanks in advance!

Amaç

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-14T04:12:11+00:00

The StandardTokenizer and StandardAnalyzer basically pass your tokens through a StandardFilter (which removes all kinds of characters from your standard tokens like ‘s at ends of words), followed by a Lowercase filter (to lowercase your words) and finally by a StopFilter. That last one removes insignificant words like "as", "in", "for", etc.

What you could easily do to get started is implement your own analyzer that performs the same as the StandardAnalyzer but uses a WhitespaceTokenizer as the first item that processes the input stream.

For more details one the inner workings of the analyzers you can have a look over here

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

My question in a nutshell: Does anyone know of a TwitterAnalyzer or TwitterTokenizer for

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply