I’m trying to make a dynamic regex that matches a person’s name. It works without problems on most names, until I ran into accented characters at the end of the name.
Example: Some Fancy Namé
The regex I’ve used so far is:
/\b(Fancy Namé|Namé)\b/i
Used like this:
"Goal: Some Fancy Namé. Awesome.".replace(/\b(Fancy Namé|Namé)\b/i, '<a href="#">$1</a>');
This simply won’t match. If I replace the é with a e, it matches just fine.
If I try to match a name such as “Some Fancy Naméa”, it works just fine.
If I remove the word last word boundary anchor, it works just fine.
Why doesn’t the word boundary flag work here? Any suggestions on how I would get around this problem?
I have considered using something like this, but I’m not sure what the performance penalties would be like:
"Some fancy namé. Allow me to ellaborate.".replace(/([\s.,!?])(fancy namé|namé)([\s.,!?]|$)/g, '$1<a href="#">$2</a>$3')
Suggestions? Ideas?
JavaScript’s regex implementation is not Unicode-aware. It only knows the ‘word characters’ in standard low-byte ASCII, which does not include
éor any other accented or non-English letters.Because
éis not a word character to JS,éfollowed by a space can never be considered a word boundary. (It would match\bif used in the middle of a word, likeNamés.)Yeah, that would be the usual workaround for JS (though probably with more punctuation characters). For other languages you’d generally use lookahead/lookbehind to avoid matching the pre and post boundary characters, but these are poorly supported/buggy in JS so best avoided.