i want to parse VB6 code via Regex. However being new to Regex I have encountered a few problems concerning the regexes to use. Currently, I have problems recognizing these constructs:
' Subs
' Sub Test
Private Sub Test(ByVal x as Integer)
'Private Sub Test(ByVal y as Integer)
Dim dummy as String
dummy = "Private Sub Test(ByVal y as Integer)"
End Sub
I have basically these 2 problems: How do I write a Regex that matches the Sub definition, and includes the all commment (and empty) lines above its definition? And how can I prevent that the Sub definitions which are either disabled by comment or included in strings aren’t matched?
Plus I need to support definitions which are spanned over multiple lines, like this one:
' Subs
' Sub Test
Private Function Test2( _
ByVal x as Integer _
) As Long
'Private Sub Test(ByVal y as Integer)
Dim dummy as String
dummy = "Private Sub Test(ByVal y as Integer)"
End Function
Any hint would be greatly appreaciated. The solutions I’ve come up with don’t work with multiple lines or capture more than just one Sub definition. It then just matches to the end of the last End Sub occurrence due to greedy matching.
My try in C# currently looks like this:
(('(?<comment>[\S \t]+[\n\r]+))*((?<accessmodifier>(Private|Public))\s+_?)(?<functiontype>(Sub|Function))\s+_?(?<name>[\S]+)\((?<parameters>[\S \t]*)\)([ \t]+As[ \t]+(?<returntype>\w+))?)|(?<endfunction>End (Sub|Function))
I’m using Multiline, Singleline, IgnoreCase, ExplicitCapture.
Thanks for your help!
Why are you parsing this code? If you’re trying to create your own compiler, you’ll need a lot more than regexes. If you’re writing an editor with syntax highlighting and type-ahead completion, regexes can do a pretty good job on the first, but not the second.
That said, the biggest problem I see with your regex is that you’re not handling line continuations properly. This:
\s+_?matches one or more whitespace characters, optionally followed by an underscore. But if there is an underscore it should be followed by a newline, which you aren’t matching. That’s easy enough to remedy –\s+(_\s+)?– but I’m not sure you need to be that specific. I suspect this:[\s_]+will do just as well.As for avoiding apparent sub/function declarations inside comments and strings, the simplest way would be to match them only at the left margin, with maybe some tabs or spaces for indentation. It’s cheating, I know, but it may be good enough for whatever you’re doing. I relied heavily on that trick when I was writing a Java file navigation scheme for EditPad Pro. You can’t do this kind of thing with regexes without employing lots of gimmicks and simplifying assumptions. Try this regex:
^(?>('(?<comment>.*[\n\r]+))*) [ \t]*(?<accessmodifier>(Private|Public)) [\s_]+(?<functiontype>(Sub|Function)) [\s_]+(?<name>\S+) [\s_]*\((?<parameters>[^()]*)\) ([\s_]+As[\s_]+(?<returntype>\w+))? | ^[ \t]*(?<endfunction>End (Sub|Function))Of course you’ll need to reassemble it first. It should be compiled with the
Multiline,IgnoreCaseandExplicitCaptureoptions, but notSingleline.