Don’t ask how I got there, but I was playing around with some masking,

Question

0

Asked: May 25, 20262026-05-25T00:53:46+00:00 2026-05-25T00:53:46+00:00

Don’t ask how I got there, but I was playing around with some masking,

0

Don’t ask how I got there, but I was playing around with some masking, loop unrolling etc. In any case, out of interest I was thinking about how I would implement an indexof method, and long story short, all that masking etc aside, this naive implementation:

public static unsafe int IndexOf16(string s, int startIndex, char c) {
            if (startIndex < 0 || startIndex >= s.Length) throw new ArgumentOutOfRangeException("startIndex");
            fixed (char* cs = s) {
                for (int i = startIndex; i < s.Length; i++) {
                    if ((cs[i]) == c) return i;
                }
                return -1;
            }
        }

is faster than string.IndexOf(char). I wrote some simple tests, and it seems to match output exactly.
Some sample output numbers from my machine (it varies to some degree of course, but the trend is clear):

short haystack 500k runs
1741 ms for IndexOf16
2737 ms for IndexOf32
2963 ms for IndexOf64
2337 ms for string.IndexOf <-- buildin

longer haystack:
2888 ms for IndexOf16
3028 ms for IndexOf32
2816 ms for IndexOf64
3353 ms for string.IndexOf <-- buildin

IndexOfChar is marked extern, so you cant reflector it. However I think this should be the (native) implementation:
http://www.koders.com/cpp/fidAB4768BA4DF45482A7A2AA6F39DE9C272B25B8FE.aspx?s=IndexOfChar#L1000

They seem to use the same naive implementation.

Questions come to my mind:

1) Am I missing something in my implementation that explains why its faster? I can only think of extended chars support, but their implementation suggests they don’t do anything special for that either.

2) I assumed much of the low level methods would ultimately be implemented in hand assembler, that seems not the case. If so, why implement it natively at all, instead of just in C# like my sample implementation?

(Complete test here (I think its too long to paste here): http://paste2.org/p/1606018 )

(No this is not premature optimization, it’s not for a project I am just messing about) 🙂

Update: Thnx to Oliver for the hint about nullcheck and the Count param. I have added these to my IndexOf16Implementation like so:

public static unsafe int IndexOf16(string s, int startIndex, char c, int count = -1) {
    if (s == null) throw new ArgumentNullException("s");
    if (startIndex < 0 || startIndex >= s.Length) throw new ArgumentOutOfRangeException("startIndex");
    if (count == -1) count = s.Length - startIndex;
    if (count < 0 || count > s.Length - startIndex) throw new ArgumentOutOfRangeException("count");

    int endIndex = startIndex + count;
    fixed (char* cs = s) {
        for (int i = startIndex; i < endIndex; i++) {
            if ((cs[i]) == c) return i;
        }
        return -1;
    }
}

The numbers changed slightly, however it is still quite significantly faster (32/64 results omitted):

short haystack 500k runs
1908 ms for IndexOf16
2361 ms for string.IndexOf
longer haystack:
3061 ms for IndexOf16
3391 ms for string.IndexOf

Update2: This version is faster yet (especially for the long haystack case):

public static unsafe int IndexOf16(string s, int startIndex, char c, int count = -1) {
            if (s == null) throw new ArgumentNullException("s");
            if (startIndex < 0 || startIndex >= s.Length) throw new ArgumentOutOfRangeException("startIndex");
            if (count == -1) count = s.Length - startIndex;
            if (count < 0 || count > s.Length - startIndex) throw new ArgumentOutOfRangeException("count");

            int endIndex = startIndex + count;
            fixed (char* cs = s) {
                char* cp = cs + startIndex;
                for (int i = startIndex; i <= endIndex; i++, cp++) {
                    if (*cp == c) return i;
                }
                return -1;
            }
        }

Update 4:
Based on the discussion with LastCoder I believe this to be architecture depended. My Xeon W3550 at works seems to prefer this version, while his i7 seems to like the buildin version. My home machine (Athlon II) appears to be in between. I am surprised about the large difference though.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T00:53:47+00:00

Possibility 1)
This may not hold (as true) in C# but when I did optimization work for x86-64 assembler I quickly found out while benchmarking that calling code from a DLL (marked external) was slower than implementing the same exact function within my executable. The most obvious reason is paging and memory, the DLL (external) method is loaded far away in memory from the rest of the running code and if it wasn’t accessed previously it’ll need to be paged in. Your benchmarking code should do some warm up loops of the functions you are benchmarking to make sure they are paged in memory first before you time them.

Possibility 2)
Microsoft tends not to optimize string functions to the fullest, so out optimizing a native string length, substring, indexof etc. isn’t really unheard of. Anecdote; in x86-64 assembler I was able to create a version of WinXP64’s RtlInitUnicodeString function that ran 2x faster in almost all practical use cases.

Possibility 3) Your benchmarking code shows that you’re using the 2 parameter overload for IndexOf, this function likely calls the 3 parameter overload IndexOf(Char, Int32, Int32) which adds an extra overhead to each iteration.

This may be even faster because your removing the i variable increment per iteration.

            char* cp = cs + startIndex;
            char* cpEnd = cp + endIndex;
            while (cp <= cpEnd) {
                if (*cp == c) return cp - cs;
                cp++;
            }

edit In reply regarding (2) for your curiosity, coded back in 2005 and used to patch the ntdll.dll of my WinXP64 machine. http://board.flatassembler.net/topic.php?t=4467

RtlInitUnicodeString_Opt: ;;rcx=buff rdx=ucharstr 77bytes
             xor    r9d,r9d
             test   rdx,rdx
             mov    dword[rcx],r9d
             mov    [rcx+8],rdx
             jz     .end
             mov    r8,rdx
   .scan:
             mov    eax,dword[rdx]

             test   ax,ax
             jz     .one
             add    rdx,4
             shr    eax,16
             test   ax,ax
             jz     .two
             jmp    .scan
   .two:
             add    rdx,2
   .one:
             mov    eax,0fffch
             sub    rdx,r8
             cmp    rdx,0fffeh
             cmovnb rdx,rax
             mov    [ecx],dx
             add    dx,2
             mov    [ecx+2],dx
             ret
   .end:
             retn

edit 2 Running your example code (updated with your fastest version) the string.IndexOf runs faster on my Intel i7, 4GB RAM, Win7 64bit.

short haystack 500k runs
2590 ms for IndexOf16
2287 ms for string.IndexOf
longer haystack:
3549 ms for IndexOf16
2757 ms for string.IndexOf

Optimizations are sometimes very architecture reliant.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Don’t ask how I got there, but I was playing around with some masking,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply