We are currently using [^a-zA-Z0-9] in Java’s replaceAll function to strip special characters from a string. It has come to our attention that we need to allow hyphen(s) when they are mixed with number(s).
Examples for which hyphens will not be matched:
- 1-2-3
- -1-23-4562
- –1—2–3—4-
- –9–a–7
- 425-12-3456
Examples for which hyphens will be matched:
- –a–b–c
- wal-mart
We think we formulated a regex to meet the latter criteria using this SO question as a reference but we have no idea how to combine it with the original regex [^a-zA-Z0-9].
We are wanting to do this to a Lucene search string because of the way Lucene’s standard tokenizer works when indexing:
Splits words at hyphens, unless there’s a number in the token, in which case the whole token is interpreted as a product number and is not split.
You can’t do this with a single regex. (Well… maybe in Perl.)
(edit: Okay, you can do it with variable-length negative lookbehind, which it appears Java can (almost uniquely!) do; see Cyborgx37’s answer. Regardless, imo, you shouldn’t do this with a single regex. :))
What you can do is split the string into words and deal with each word individually. My Java is pretty terrible so here is some hopefully-sensible Python:
When I run this with
'wal-mart 1-2-3', I get back'walmart 1-2-3'.But honestly, the above code reproduces most of what the Lucene tokenizer is already doing. I think you’d be better off just copying
StandardTokenizerinto your own project and modifying it to do what you want.