I have a regex ([-@.\/,':\w]*[\w])* and it matches all words within a text (including punctuated words like I.B.M), but I want to make it exclude underscores and I can’t seem to figure out how to do it… I tried adding ^[_] (e.g. (^[_][-@.\/,':\w]*[\w])*) but it just breaks up all the words into letters. I want to preserve the word matching, but I don’t want to have words with underscores in them, nor words that are entirely made up of underscores.
Whats the proper way to do this?
P.S.
- My app is written in C# (if that makes any difference).
- I can’t use A-Za-z0-9 because I have to match words regardless of the language (could be Chinese, Russian, Japanese, German, English).
Update
Here is an example:
“I.B.M should be parsed as one word w_o_r_d! Russian should work too: мплекс исторических событий.”
The matches should be:
I.B.M.
should
be
parsed
as
one
word
Russian
should
work
too
мплекс
исторических
событий
Note that w_o_r_d should not get matched.
Try this instead:
The
\wclass is composed of[\p{L}\p{Nd}\p{Pc}]when you’re performing Unicode matching. (Or simply[a-zA-Z0-9]if you’re doing non-Unicode matching.)It’s the
\p{Pc}Unicode category — punctuation/connector — that causes the problem by matching underscores, so we explicitly match against the other categories without including that one.(Further information here, “Character Classes: Word Character”, and here, “Character Classes: Supported Unicode General Categories”.)