I need to write a regular expression so I could replace the invalid characters in user’s input before sending it further. I think i need to use string.replaceAll("regex", "replacement") to do that.
The particular line of code should replace all characters which are not unicode letters. So it’s a white list of unicode characters. Basically it’s validating and replacing the invalid characters of user’s first name.
What I’ve found so far is this: \p{L}\p{M}, but I’m not sure how to fire it up in regexp so it would work as I explained above. Would this be a regex negation case?
Yes, you need negation. The regular expression would be
[^\p{L}]for anything except letters. Another way to write this would be\P{L}.\p{M}means “all marks”, thus[^\p{L}\p{M}]means **anything which is neither letter nor mark. This also could be written as[\P{L}&&[\P{M}]], but this is not really better.In a Java-String all
\have to be doubled, so you would writestring.replaceAll("[^\\p{L}\\p{M}]", "replacement")there.From a comment:
This category consists of the subcategories
Mn: Mark, Non-Spacing
An example for this is
̀, U+0300. This is the COMBINING GRAVE ACCENT, and can be used together with a letter (the letter before) to create accented characters. For the commonly used accented characters there is already a precomposed form (e.g.é), but for other ones there is not.Mc: Mark, Spacing Combining.
These are quite seldom … I found them mainly in south-asian scripts, and for musical notes. For example, we have U+1D165, MUSICAL SYMBOL COMBINING STEM. 텦, which could be combined with U+1D15D, MUSICAL SYMBOL WHOLE NOTE, 텝, to something like 텝텦. (Hmm, the images do not look right here. I suppose my browser does not support these characters. Have a look at the code charts, if they are wrong here.)
Me: Mark, Enclosing
These are marks which somehow enclose the base letter (the previous one, if I understand right). One example would be U+20DD, ⃝, which allows creating things like
A⃝. (This should be rendered as an A enclosed by a circle, if I understand right. It does not, in my browser.) Another one would be U+20E3, ⃣, COMBINING ENCLOSING KEYCAP, which should give the look of a key cap with the letter on it (A⃣). (They do not show in my browser. Have a look at the code chart, if you can’t see them.)You can find them all by searching in Unicode-Data.txt for
;Mn;,;Mc;or;Me;, respectively. Some more information is in the FAQ: Characters and Combining Marks.Do you need them? I’m not sure here. Most common names (at least in latin alphabets) would use precomposed letters, I think. But the user might input them in decomposed form – I think on Mac OS X this is actually the default. You would have to run the normalization algorithm before filtering away unknown characters. (Running the normalization seems a good idea anyway if you want to compare the names and not only show them on screen.)
Edit: not directly relating to the question, but relating to the discussion in the comments:
I wrote a quick test program to show that
[^\pL\pM]is not equivalent to[\PL\PM]:Here is the output (with OpenJDK 1.6.0_20 on OpenSUSE):
We can see that:
[^\pL\pM]is not equivalent to[\PL\PM][\PL\PM]really matches everything, but[\PL\PM]is not equal to., since.does not match\nand\r.The second point is caused by the fact that
[\PL\PM]is the union of\PLand\PM:\PLcontains characters from all categories other than L (including M), and\PMcontains characters from all categories other than M (including L) – together they contain the whole character repertoire.[^pL\pM], on the other hand, is the complement of the union of\pLand\pM, which is equivalent to the intersection of\PLandPM.