We are currently using [^a-zA-Z0-9] in Java’s replaceAll function to strip special characters from

Question

0

Asked: June 17, 20262026-06-17T11:37:18+00:00 2026-06-17T11:37:18+00:00

We are currently using [^a-zA-Z0-9] in Java’s replaceAll function to strip special characters from

0

We are currently using [^a-zA-Z0-9] in Java’s replaceAll function to strip special characters from a string. It has come to our attention that we need to allow hyphen(s) when they are mixed with number(s).

Examples for which hyphens will not be matched:

1-2-3
-1-23-4562
–1—2–3—4-
–9–a–7
425-12-3456

Examples for which hyphens will be matched:

–a–b–c
wal-mart

We think we formulated a regex to meet the latter criteria using this SO question as a reference but we have no idea how to combine it with the original regex [^a-zA-Z0-9].

We are wanting to do this to a Lucene search string because of the way Lucene’s standard tokenizer works when indexing:

Splits words at hyphens, unless there’s a number in the token, in which case the whole token is interpreted as a product number and is not split.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T11:37:19+00:00

You can’t do this with a single regex. (Well… maybe in Perl.)

(edit: Okay, you can do it with variable-length negative lookbehind, which it appears Java can (almost uniquely!) do; see Cyborgx37’s answer. Regardless, imo, you shouldn’t do this with a single regex. :))

What you can do is split the string into words and deal with each word individually. My Java is pretty terrible so here is some hopefully-sensible Python:

# Precompile some regex
looks_like_product_number = re.compile(r'\A[-0-9]+\Z')
not_wordlike = re.compile(r'[^a-zA-Z0-9]')
not_wordlike_or_hyphen = re.compile(r'[^-a-zA-Z0-9]')

# Split on anything that's not a letter, number, or hyphen -- BUT dots
# must be followed by whitespace
words = re.split(r'(?:[^-.a-zA-Z0-9]|[.]\s)+', string)

stripped_words = []
for word in words:
    if '-' in word and not looks_like_product_number.match(word):
        stripped_word = not_wordlike.sub('', word)
    else:
        # Product number; allow dashes
        stripped_word = not_wordlike_or_hyphen.sub('', word)

    stripped_words.append(stripped_word)

pass_to_lucene(' '.join(stripped_words))

When I run this with 'wal-mart 1-2-3', I get back 'walmart 1-2-3'.

But honestly, the above code reproduces most of what the Lucene tokenizer is already doing. I think you’d be better off just copying StandardTokenizer into your own project and modifying it to do what you want.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

We are currently using [^a-zA-Z0-9] in Java’s replaceAll function to strip special characters from

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply