I want to use the following regular expression which is written within a C# .NET code, in a Java code, but I can’t seem to convert it right, can you help me out?
Regex(@"\w+:\/\/(?<Domain>[\x21-\x22\x24-\x2E\x30-\x3A\x40-\x5A\x5F\x61-\x7A]+)(?<Relative>/?\S*)", RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);
The most direct translation would be:
Java has no equivalent for C#’s verbatim strings, so you always have to escape backslashes. And Java’s regexes don’t support named groups, so I converted those to simple capturing groups (named groups are due to be added in Java 7).
But there are a few problems with the original regex:
The
RegexOptions.Compiledmodifier doesn’t do what you probably think it does. Specifically, it’s not related to Java’scompile()method; that’s just a factory method, roughly equivalent to C#’snew Regex()constructor. TheCompiledmodifier causes the regex to be compiled to CIL bytecode, which can make it match a lot faster, but at a considerable cost in upfront processing and memory use–and that memory never gets garbage-collected. If you don’t use the regex a lot, theCompiledoption is probably doing more harm than good, performance-wise.The
IgnoreCase/CASE_INSENSITIVEmodifier is pointless since your regex always matches both upper- and lowercase variants wherever it matches letters.The
Singleline/DOTALLmodifier is pointless since you never use the dot metacharacter.In .NET regexes, the character-class shorthand
\wis Unicode-aware, equivalent to[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}]. In Java it’s ASCII-only —[A-Za-z0-9_]— which seems to be more in line with the way you’re using it (you could “dumb it down” in .NET by using theRegexOptions.ECMAScriptmodifier).So the actual translation would be more like this: