EDIT Apologies if the original unedited question is misleading. This question is not asking

Question

0

Editorial Team

Asked: June 2, 20262026-06-02T05:46:34+00:00 2026-06-02T05:46:34+00:00

EDIT Apologies if the original unedited question is misleading. This question is not asking

0

EDIT

Apologies if the original unedited question is misleading.

This question is not asking how to remove Invalid XML Chars from a string, answers to that question would be better directed here.

I’m not asking you to review my code.

What I’m looking for in answers is, a function with the signature

string <YourName>(string input, Func<char, bool> check);

that will have performance similar or better than RemoveCharsBufferCopyBlackList. Ideally this function would be more generic and if possible simpler to read, but these requirements are secondary.

I recently wrote a function to strip invalid XML chars from a string. In my application the strings can be modestly long and the invalid chars occur rarely. This excerise got me thinking. What ways can this be done in safe managed c# and, which would offer the best performance for my scenario.

Here is my test program, I’ve subtituted the “valid XML predicate” for one the omits the char 'X'.

class Program
{
    static void Main()
    {
        var attempts = new List<Func<string, Func<char, bool>, string>>
            {
                RemoveCharsLinqWhiteList,
                RemoveCharsFindAllWhiteList,
                RemoveCharsBufferCopyBlackList
            }

        const string GoodString = "1234567890abcdefgabcedefg";
        const string BadString = "1234567890abcdefgXabcedefg";
        const int Iterations = 100000;
        var timer = new StopWatch();

        var testSet = new List<string>(Iterations);
        for (var i = 0; i < Iterations; i++)
        {
            if (i % 1000 == 0)
            {
                testSet.Add(BadString);
            }
            else
            {
                testSet.Add(GoodString);
            }
        }

        foreach (var attempt in attempts)
        {
            //Check function works and JIT
            if (attempt.Invoke(BadString, IsNotUpperX) != GoodString)
            {
                throw new ApplicationException("Broken Function");       
            }

            if (attempt.Invoke(GoodString, IsNotUpperX) != GoodString)
            {
                throw new ApplicationException("Broken Function");       
            }

            timer.Reset();
            timer.Start();
            foreach (var t in testSet)
            {
                attempt.Invoke(t, IsNotUpperX);
            }

            timer.Stop();
            Console.WriteLine(
                "{0} iterations of function \"{1}\" performed in {2}ms",
                Iterations,
                attempt.Method,
                timer.ElapsedMilliseconds);
            Console.WriteLine();
        }

        Console.Readkey();
    }

    private static bool IsNotUpperX(char value)
    {
        return value != 'X';
    }

    private static string RemoveCharsLinqWhiteList(string input,
                                                      Func<char, bool> check);
    {
        return new string(input.Where(check).ToArray());
    }

    private static string RemoveCharsFindAllWhiteList(string input,
                                                      Func<char, bool> check);
    {
        return new string(Array.FindAll(input.ToCharArray(), check.Invoke));
    }

    private static string RemoveCharsBufferCopyBlackList(string input,
                                                      Func<char, bool> check);
    {
        char[] inputArray = null;
        char[] outputBuffer = null;

        var blackCount = 0;
        var lastb = -1;
        var whitePos = 0;

        for (var b = 0; b , input.Length; b++)
        {
            if (!check.invoke(input[b]))
            {
                var whites = b - lastb - 1;
                if (whites > 0)
                {
                    if (outputBuffer == null)
                    {
                        outputBuffer = new char[input.Length - blackCount];
                    }

                    if (inputArray == null)
                    {
                        inputArray = input.ToCharArray();
                    }

                    Buffer.BlockCopy(
                                      inputArray,
                                      (lastb + 1) * 2,
                                      outputBuffer,
                                      whitePos * 2,
                                      whites * 2);
                    whitePos += whites; 
                }

                lastb = b;
                blackCount++;
            }
        }

        if (blackCount == 0)
        {
            return input;
        }

        var remaining = inputArray.Length - 1 - lastb;
        if (remaining > 0)
        {
            Buffer.BlockCopy(
                              inputArray,
                              (lastb + 1) * 2,
                              outputBuffer,
                              whitePos * 2,
                              remaining * 2);

        }

        return new string(outputBuffer, 0, inputArray.Length - blackCount);
    }        
}

If you run the attempts you’ll note that the performance improves as the functions get more specialised. Is there a faster and more generic way to perform this operation? Or if there is no generic option is there a way that is just faster?

Please note that I am not actually interested in removing ‘X’ and in practice the predicate is more complicated.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-02T05:46:35+00:00

You certainly don’t want to use LINQ to Objects aka enumerators to do this if you require high performance. Also, don’t invoke a delegate per char. Delegate invocations are costly compared to the actual operation you are doing.

RemoveCharsBufferCopyBlackList looks good (except for the delegate call per character).

I recommend that you inline the contents of the delegate hard-coded. Play around with different ways to write the condition. You may get better performance by first checking the current char against a range of known good chars (e.g. 0x20-0xFF) and if it matches let it through. This test will pass almost always so you can save the expensive checks against individual characters which are invalid in XML.

Edit: I just remembered I solved this problem a while ago:

    static readonly string invalidXmlChars =
        Enumerable.Range(0, 0x20)
        .Where(i => !(i == '\u000A' || i == '\u000D' || i == '\u0009'))
        .Select(i => (char)i)
        .ConcatToString()
        + "\uFFFE\uFFFF";
    public static string RemoveInvalidXmlChars(string str)
    {
        return RemoveInvalidXmlChars(str, false);
    }
    internal static string RemoveInvalidXmlChars(string str, bool forceRemoveSurrogates)
    {
        if (str == null) throw new ArgumentNullException("str");
        if (!ContainsInvalidXmlChars(str, forceRemoveSurrogates))
            return str;

        str = str.RemoveCharset(invalidXmlChars);
        if (forceRemoveSurrogates)
        {
            for (int i = 0; i < str.Length; i++)
            {
                if (IsSurrogate(str[i]))
                {
                    str = str.Where(c => !IsSurrogate(c)).ConcatToString();
                    break;
                }
            }
        }

        return str;
    }
    static bool IsSurrogate(char c)
    {
        return c >= 0xD800 && c < 0xE000;
    }
    internal static bool ContainsInvalidXmlChars(string str)
    {
        return ContainsInvalidXmlChars(str, false);
    }
    public static bool ContainsInvalidXmlChars(string str, bool forceRemoveSurrogates)
    {
        if (str == null) throw new ArgumentNullException("str");
        for (int i = 0; i < str.Length; i++)
        {
            if (str[i] < 0x20 && !(str[i] == '\u000A' || str[i] == '\u000D' || str[i] == '\u0009'))
                return true;
            if (str[i] >= 0xD800)
            {
                if (forceRemoveSurrogates && str[i] < 0xE000)
                    return true;
                if ((str[i] == '\uFFFE' || str[i] == '\uFFFF'))
                    return true;
            }
        }
        return false;
    }

Notice, that RemoveInvalidXmlChars first invokes ContainsInvalidXmlChars to save the string allocation. Most strings do not contain invalid XML chars so we can be optimistic.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

EDIT Apologies if the original unedited question is misleading. This question is not asking

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply