Basically what I would like to do is run multiple (15-25) regex replaces on a single string with the best possible memory management.
Overview:
Streams a text only file (sometimes html) via ftp appending to a StringBuilder to get a very large string. The file size ranges from 300KB to 30MB.
The regular expressions are semi-complex, but require multiple lines of the file (identifying sections of a book for example), so arbitrarily breaking the string, or running the replace on every download loop is out of the answer.
A sample replace:
Regex re = new Regex("<A.*?>Table of Contents</A>", RegexOptions.IgnoreCase);
source = re.Replace(source, "");
With each run of a replace the memory sky rockets, I know this is because string are immutable in C# and it needs to make a copy – even if I call GC.Collect() it still doesn’t help enough for a 30MB file.
Any advice on a better way to approach, or a way to perform multiple regex replaces using constant memory (make 2 copies (so 60MB in memory), perform search, discard copy back to 30MB)?
Update:
There does not appear to be a simple answer but for future people looking at this I ended up using a combination of all the answers below to get it to an acceptable state:
-
If possible split the string into chunks, see manojlds’s answer for a way to that as the file is being read – looking for suitable end points.
-
If you can’t split as it streams, at least split it later if possible – see ChrisWue’s answer for some external tools that may help with this process to piping to files.
-
Optimize the regex, avoid greedy operators and try to limit what the engine has to do as much as possible – see Sylverdrag’s answer.
-
Combine the regex when possible, this cuts down the number of replaces for when the regexs are not based on each other (useful in this case for cleaning bad input) – see Brian Reichle’s answer for a code sample.
Thank you all!
Depending on the nature of the RegEx’s, you might be able to combine them into a single regular expression and use the overload of Replace() that takes in a MatchEvaluator delegate to determine the replacement from the matched string.
Of course this falls apart if latter patterns need to be able to match on the result of earlier replacements.