I’m using a regular expression to replace commas that are not contained by text qualifying quotes into tab spaces.
I’m running the regex on file content through a script task in SSIS. The file content is over 6000 lines long.
I saw an example of using a regex on file content that looked like this
String FileContent = ReadFile(FilePath, ErrInfo);
Regex r = new Regex(@"(,)(?=(?:[^""]|""[^""]*"")*$)");
FileContent = r.Replace(FileContent, "\t");
That replace can understandably take its sweet time on a decent sized file.
Is there a more efficient way to run this regex?
Would it be faster to read the file line by line and run the regex per line?
The problem is the lookahead, which looks all the way to the end on each comman, resulting in O(n2) complexity, which is noticeable on long inputs. You can get it done in a single pass by skipping over quotes while replacing:
Here we match free command and quoted strings. The
Replacemethod takes a callback with a condition that checks if we found a comma or not, and replaced accordingly.