I am trying to write a regular expression to rewrite URLs to point to a proxy server.
bodystring = Regex.Replace(bodystring, "(src='/+)", "$1" + proxyStr);
The idea of this expression is pretty simple, basically find instances of “src=’/” or “src=’//” and insert a PROXY url at that point. This works in general but occasionally I have found cases where a literal “$1” will end up in the result string.
This makes no sense to me because if there was no match, then why would it replace anything at all?
Unfortunately I can’t give a simple example of this at it only happens with very large strings so far, but I’d like to know conceptually what could make this sort of thing happen.
As an aside, I tried rewriting this expression using a positive lookbehind as follows:
bodystring = Regex.Replace(bodystring, "(?<=src='/+)", proxyStr);
But this ends up with proxyStr TWICE in the output if the input string contains “src=’//”. This also doesn’t make much sense to me because I thought that “src=” would have to be present in the input twice in order to get proxyStr to end up twice in the output.
When
proxyStr = "10.15.15.15:8008/proxy?url=http://", the replacement string becomes"$110.15.15.15:8008/proxy?url=http://". It contains a reference to group number 110, which certainly does not exist.You need to make sure that your proxy string does not start in a digit. In your case you can do it by not capturing the last slash, and changing the replacement string to
"$1/"+proxyStr, like this:Edit:
Rawling pointed out that .NET’s regexp library addresses this issue: you can enclose
1in curly braces to avoid false aliasing, like this: