I have a routine that needs to be supplied with normalized strings. However, the

Question

0

Asked: May 28, 20262026-05-28T02:56:22+00:00 2026-05-28T02:56:22+00:00

I have a routine that needs to be supplied with normalized strings. However, the

0

I have a routine that needs to be supplied with normalized strings. However, the data that’s coming in isn’t necessarily clean, and String.Normalize() raises ArgumentException if the string contains invalid code points.

What I’d like to do is just replace those code points with a throwaway character such as ‘?’. But to do that I need an efficient way to search through the string to find them in the first place. What is a good way to do that?

The following code works, but it’s basically using try/catch as a crude if-statement so performance is terrible. I’m just sharing it to illustrate the behavior I’m looking for:

private static string ReplaceInvalidCodePoints(string aString, string replacement)
{
    var builder = new StringBuilder(aString.Length);
    var enumerator = StringInfo.GetTextElementEnumerator(aString);

    while (enumerator.MoveNext())
    {
        string nextElement;
        try { nextElement = enumerator.GetTextElement().Normalize(); }
        catch (ArgumentException) { nextElement = replacement; }
        builder.Append(nextElement);
    }

    return builder.ToString();
}

(edit:) I’m thinking converting the text to UTF-32 so that I could quickly iterate over it and see if each dword corresponds to a valid code point. Is there a function that will do that? If not, is there a list of invalid ranges floating around out there?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T02:56:23+00:00

It seems like the only way to do it is ‘manually’ like you’ve done. Here’s a version that gives the same results as yours, but is a bit faster (about 4 times over a string of all chars up to char.MaxValue, less improvement up to U+10FFFF) and doesn’t require unsafe code. I’ve also simplified and commented my IsCharacter method to explain each selection:

static string ReplaceNonCharacters(string aString, char replacement)
{
    var sb = new StringBuilder(aString.Length);
    for (var i = 0; i < aString.Length; i++)
    {
        if (char.IsSurrogatePair(aString, i))
        {
            int c = char.ConvertToUtf32(aString, i);
            i++;
            if (IsCharacter(c))
                sb.Append(char.ConvertFromUtf32(c));
            else
                sb.Append(replacement);
        }
        else
        {
            char c = aString[i];
            if (IsCharacter(c))
                sb.Append(c);
            else
                sb.Append(replacement);
        }
    }
    return sb.ToString();
}

static bool IsCharacter(int point)
{
    return point < 0xFDD0 || // everything below here is fine
        point > 0xFDEF &&    // exclude the 0xFFD0...0xFDEF non-characters
        (point & 0xfffE) != 0xFFFE; // exclude all other non-characters
}

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a routine that needs to be supplied with normalized strings. However, the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply