I wrote part of a program that does some heavy work with strings in C#. I initially chose C# not only because it was easier to use .NET’s data structures, but also because I need to use this program to analyse some 2-3 million text records in a database, and it is much easier to connect to databases using C#.
There was a part of the program that was slowing down the whole code, and I decided to rewrite it in C using pointers to access every character in the string, and now the part of the code that took some 119 seconds to analyse 10,000,000 strings in C# takes the C code only 5 seconds! Performance is a priority, so I am considering rewriting the whole program in C, compiling it into a dll (something which I didn’t know how to do when I started writing the program) and using DllImport from C# to use its methods to work with the database strings.
Given that rewriting the whole program will take some time, and since using DllImport to work with C#’s strings requires marshalling and such things, my question is will the performance gains from the C dll’s faster string handling outweigh the performance hit of having to repeatedly marshal strings to access the C dll from C#?
First, profile your code. You might find some real headsmacker that speeds the C# code up greatly.
Second, writing the code in C using pointers is not really a fair comparison. If you are going to use pointers why not write it in assembly language and get real performance? (Not really, just reductio ad absurdam.) A better comparison for native code would be to use
std::string. That way you still get a lot of help from thestringclass and C++ exception-safety.Given that you have to read 2-3 million records from the DB to do this work, I very much doubt that the time spent cracking the strings is going to outweigh the elapsed time taken to load the data from the DB. So, consider instead how to structure your code so that you can begin string processing while the DB load is in progress.
If you use a
SqlDataReader(say) to load the rows sequentially, it should be possible to batch up N rows as fast as possible and hand off to a separate thread for the post-processing that is your current headache and reason for this question. If you are on .Net 4.0 this is simplest to do using Task Parallel Library, and System.Collections.Concurrent could also be useful for collation of results between the threads.This approach should mean that neither the DB latency nor the string processing is a show-stopping bottleneck, because they happen in parallel. This applies even if you are on a single-processor machine because your app can process strings while it’s waiting for the next batch of data to come back from the DB over the network. If you find string processing is the slowest, use more threads (ie.
Tasks) for that. If the DB is the bottleneck, then you have to look at external means to improve its performance – DB hardware or schema, network infrastructure. If you need some results in hand before processing more data, TPL allows dependencies to be created betweenTasks and the coordinating thread.My point is that I doubt it’s worth the pain of re-engineering the entire app in native C or whatever. There are lots of ways to skin this cat.