I’m interested in language-specific validators via regex. I know that I can validate a person’s name, in any language, with a pattern like this:
“[\p{L}\p{M}]”
But what if I want validation to be for a specific language? It would be nice if my thread’s CurrentUICulture or CurrentCulture setting would simply convert the meaning of "[\w]" to something appropriate for German, Spanish, English, and especially Chinese. Does it work that way? If yes, then this is likely my answer.
If not, then my next interest would be to use a regex script annotation. However, I notice that:
- The list given in that link does not include simplified “Chinese”, which I am particularly interested in.
- I don’t think .NET regex capabilities support script-based matching. Yes? No?
So my final option, if I can’t get the prior two options to work, is to turn to named blocks. At least the list of .net supported named blocks includes several entries for CJK. I suppose I can simply combine the several CJK blocks, and call that (simplified) “Chinese.”
Thoughts?
I have concluded that, in a .net setting, there is no such thing as a regex that is sensitive to the CurrentUICulture. I have also concluded that the most permissive reasonable scenario is to perform a validation – applicable to all languages simultaneously – that simply rejects all forms of non-printable characters, “dingbats”, angle-brackets (to prevent markup injection), and math symbols:
The mid-permissive approach is to use a string that expressly captures both Western and Eastern character sets (including diacritics and “combining characters”):
The least-permissive approach, if I want only Western characters, is this:
The above still allows all forms of quote marks, which usually apply for names like O’Toole.