I have a routine that needs to be supplied with normalized strings. However, the data that’s coming in isn’t necessarily clean, and String.Normalize() raises ArgumentException if the string contains invalid code points.
What I’d like to do is just replace those code points with a throwaway character such as ‘?’. But to do that I need an efficient way to search through the string to find them in the first place. What is a good way to do that?
The following code works, but it’s basically using try/catch as a crude if-statement so performance is terrible. I’m just sharing it to illustrate the behavior I’m looking for:
private static string ReplaceInvalidCodePoints(string aString, string replacement)
{
var builder = new StringBuilder(aString.Length);
var enumerator = StringInfo.GetTextElementEnumerator(aString);
while (enumerator.MoveNext())
{
string nextElement;
try { nextElement = enumerator.GetTextElement().Normalize(); }
catch (ArgumentException) { nextElement = replacement; }
builder.Append(nextElement);
}
return builder.ToString();
}
(edit:) I’m thinking converting the text to UTF-32 so that I could quickly iterate over it and see if each dword corresponds to a valid code point. Is there a function that will do that? If not, is there a list of invalid ranges floating around out there?
It seems like the only way to do it is ‘manually’ like you’ve done. Here’s a version that gives the same results as yours, but is a bit faster (about 4 times over a string of all
charsup tochar.MaxValue, less improvement up toU+10FFFF) and doesn’t requireunsafecode. I’ve also simplified and commented myIsCharactermethod to explain each selection: