I’m developing a Portuguese software, so many of my entities have names like ‘maça’ or ‘lição’ and I want to use the entity as a resource key. So I want keep every character except the ‘ç,ã,õ….’
There is some optimum solution using regex? My actual regex is (as Remove characters using Regex suggest):
Regex regex = new Regex(@"[\W_]+");
string cleanText = regex.Replace(messyText, "").ToUpper();
only to emphasize, I’m worried just with Latin characters.
A simple option is to white-list the accepted characters:
If you want to remove all non-ASCII letters but keep all other characters, you can use character class subtraction:
It can also be written as the more standard and complicated
[^\P{L}a-zA-Z]+(or\W), which reads "select all characters that are not word letters or ASCII letters", which ends up with the letters we’re looking for.Just some context for
\W: It stands for "not a word character", meaning anything other than a-z,A-Z,0-9 and underscore _You may also consider the following approach more useful: How do I remove diacritics (accents) from a string in .NET?