I’m having a very strange problem where my execution jumps from semi-predictable locations to another location while debugging a Visual Studio .NET unit test. The method in which this strange behavior occurs is “Parse(…)”, below. I’ve indicated in this method the one location where execution will jump to (“// EXCEPTION”). I’ve also indicated several of the places where, in my testing, execution was when it strangely jumped (“// JUMP”). The jump will frequently occur from the same place several times consecutively, and then begin jumping from a new location consecutively. These places from which execution jumps are either the beginning of switch statements or the ends of code blocks, suggesting to me that there is something funky going on with the instruction pointer, but I’m not .NET savvy enough to know what that something might be. If it makes any difference, the execution does not jump to immediately before the “throw” statement, but instead to a point in execution where the exception has just been thrown. Very strange.
In my experience the execution jump only occurs while parsing the contents of a nested named group.
Background on what the code below purports to do: the solution I’m trying to implement is a simple regular expression parser. This is not a full-out regex parser. My needs are only to be able to locate particular named groups inside a regular expression and replace some of the named groups’ contents with other contents. So basically I’m just running through a regular expression and keeping track of named groups I find. I also keep track of unnamed groups, since I need to be aware of parenthesis matching, and comments, so that commented parentheses don’t upset paren-matching. A separate (and as-of-yet unimplemented) piece of code will reconstruct a string containing the regex after taking into account the replacements.
I greatly appreciate any suggestions of what might be afoot; I’m baffled!
Example Solution
Here is a Visual Studio 2010 Solution (TAR format) containing all the code I discuss below. I have the error when running this solution (with the unit test project “TestRegexParserLibTest” as the Startup Project.) Since this seems to be such a sporadic error, I’d be interested if anyone else experiences the same problem.
Code
I use some simple classes to organize the results:
// The root of the regex we are parsing
public class RegexGroupStructureRoot : ISuperRegexGroupStructure
{
public List<RegexGroupStructure> SubStructures { get; set; }
public RegexGroupStructureRoot()
{
SubStructures = new List<RegexGroupStructure>();
}
public override bool Equals(object obj) { ... }
}
// Either a RegexGroupStructureGroup or a RegexGroupStructureRegex
// Contained within the SubStructures of both RegexGroupStructureRoot and RegexGroupStructureGroup
public abstract class RegexGroupStructure
{
}
// A run of text containing regular expression characters (but not groups)
public class RegexGroupStructureRegex : RegexGroupStructure
{
public string Regex { get; set; }
public override bool Equals(object obj) { ... }
}
// A regular expression group
public class RegexGroupStructureGroup : RegexGroupStructure, ISuperRegexGroupStructure
{
// Name == null indicates an unnamed group
public string Name { get; set; }
public List<RegexGroupStructure> SubStructures { get; set; }
public RegexGroupStructureGroup()
{
SubStructures = new List<RegexGroupStructure>();
}
public override bool Equals(object obj) { ... }
}
// Items that contain SubStructures
// Either a RegexGroupStructureGroup or a RegexGroupStructureRoot
interface ISuperRegexGroupStructure
{
List<RegexGroupStructure> SubStructures { get; }
}
Here’s the method (and associated enum/static members) where I actually parse the regular expression, returning a RegexGroupStructureRoot that contains all the named groups, unnamed groups, and other regular expression characters that were found.
using Re = System.Text.RegularExpressions
enum Mode
{
TopLevel, // Not in any group
BeginGroup, // Just encountered a character beginning a group: "("
BeginGroupTypeControl, // Just encountered a character controlling group type, immediately after beginning a group: "?"
NamedGroupName, // Reading the named group name (must have encountered a character indicating a named group type immediately following a group type control character: "<" after "?")
NamedGroup, // Reading the contents of a named group
UnnamedGroup, // Reading the contents of an unnamed group
}
static string _NamedGroupNameValidCharRePattern = "[A-Za-z0-9_]";
static Re.Regex _NamedGroupNameValidCharRe;
static RegexGroupStructureParser()
{
_NamedGroupNameValidCharRe = new Re.Regex(_NamedGroupNameValidCharRePattern);
}
public static RegexGroupStructureRoot Parse(string regex)
{
string newLine = Environment.NewLine;
int newLineLen = newLine.Length;
// A record of the parent structures that the parser has created
Stack<ISuperRegexGroupStructure> parentStructures = new Stack<ISuperRegexGroupStructure>();
// The current text we've encountered
StringBuilder textConsumer = new StringBuilder();
// Whether the parser is in an escape sequence
bool escaped = false;
// Whether the parser is in an end-of-line comment (such comments run from a hash-sign ('#') to the end of the line
// The other type of .NET regular expression comment is the group-comment: (?#This is a comment)
// We do not need to specially handle this type of comment since it is treated like an unnamed
// group.
bool commented = false;
// The current mode of the parsing process
Mode mode = Mode.TopLevel;
// Push a root onto the parents to accept whatever regexes/groups we encounter
parentStructures.Push(new RegexGroupStructureRoot());
foreach (char chr in regex.ToArray())
{
if (escaped) // JUMP
{
textConsumer.Append(chr);
escaped = false;
}
else if (chr.Equals('#'))
{
textConsumer.Append(chr);
commented = true;
}
else if (commented)
{
textConsumer.Append(chr);
string txt = textConsumer.ToString();
int txtLen = txt.Length;
if (txtLen >= newLineLen &&
// Does the current text end with a NewLine?
txt.Substring(txtLen - 1 - newLineLen, newLineLen) == newLine)
{
// If so we're no longer in the comment
commented = false;
}
}
else
{
switch (mode) // JUMP
{
case Mode.TopLevel:
switch (chr)
{
case '\\':
textConsumer.Append(chr); // Append the backslash
escaped = true;
break;
case '(':
beginNewGroup(parentStructures, ref textConsumer, ref mode);
break;
case ')':
// Can't close a group if we're already at the top-level
throw new InvalidRegexFormatException("Too many ')'s.");
default:
textConsumer.Append(chr);
break;
}
break;
case Mode.BeginGroup:
switch (chr)
{
case '?':
// If it's an unnamed group, we'll re-add the question mark.
// If it's a named group, named groups reconstruct question marks so no need to add it.
mode = Mode.BeginGroupTypeControl;
break;
default:
// Only a '?' can begin a named group. So anything else begins an unnamed group.
parentStructures.Peek().SubStructures.Add(new RegexGroupStructureRegex()
{
Regex = textConsumer.ToString()
});
textConsumer = new StringBuilder();
parentStructures.Push(new RegexGroupStructureGroup()
{
Name = null, // null indicates an unnamed group
SubStructures = new List<RegexGroupStructure>()
});
mode = Mode.UnnamedGroup;
break;
}
break;
case Mode.BeginGroupTypeControl:
switch (chr)
{
case '<':
mode = Mode.NamedGroupName;
break;
default:
// We previously read a question mark to get here, but the group turned out not to be a named group
// So add back in the question mark, since unnamed groups don't reconstruct with question marks
textConsumer.Append('?' + chr);
mode = Mode.UnnamedGroup;
break;
}
break;
case Mode.NamedGroupName:
if (chr.Equals( '>'))
{
// '>' closes the named group name. So extract the name
string namedGroupName = textConsumer.ToString();
if (namedGroupName == String.Empty)
throw new InvalidRegexFormatException("Named group names cannot be empty.");
// Create the new named group
RegexGroupStructureGroup newNamedGroup = new RegexGroupStructureGroup() {
Name = namedGroupName,
SubStructures = new List<RegexGroupStructure>()
};
// Add this group to the current parent
parentStructures.Peek().SubStructures.Add(newNamedGroup);
// ...and make it the new parent.
parentStructures.Push(newNamedGroup);
textConsumer = new StringBuilder();
mode = Mode.NamedGroup;
}
else if (_NamedGroupNameValidCharRe.IsMatch(chr.ToString()))
{
// Append any valid named group name char to the growing named group name
textConsumer.Append(chr);
}
else
{
// chr is neither a valid named group name character, nor the character that closes the named group name (">"). Error.
throw new InvalidRegexFormatException(String.Format("Invalid named group name character: {0}", chr)); // EXCEPTION
}
break; // JUMP
case Mode.NamedGroup:
case Mode.UnnamedGroup:
switch (chr) // JUMP
{
case '\\':
textConsumer.Append(chr);
escaped = true;
break;
case ')':
closeGroup(parentStructures, ref textConsumer, ref mode);
break;
case '(':
beginNewGroup(parentStructures, ref textConsumer, ref mode);
break;
default:
textConsumer.Append(chr);
break;
}
break;
default:
throw new Exception("Exhausted Modes");
}
} // JUMP
}
ISuperRegexGroupStructure finalParent = parentStructures.Pop();
Debug.Assert(parentStructures.Count < 1, "Left parent structures on the stack.");
Debug.Assert(finalParent.GetType().Equals(typeof(RegexGroupStructureRoot)), "The final parent must be a RegexGroupStructureRoot");
string finalRegex = textConsumer.ToString();
if (!String.IsNullOrEmpty(finalRegex))
finalParent.SubStructures.Add(new RegexGroupStructureRegex() {
Regex = finalRegex
});
return finalParent as RegexGroupStructureRoot;
}
And here is a unit test that will test if the method works (note, may not be 100% correct since I don’t even get past the call to RegexGroupStructureParser.Parse.)
[TestMethod]
public void ParseTest_Short()
{
string regex = @"
(?<Group1>
,?\s+
(?<Group1_SubGroup>
[\d–-]+ # One or more digits, hyphen, and/or n-dash
)
)
";
RegexGroupStructureRoot expected = new RegexGroupStructureRoot()
{
SubStructures = new List<RegexGroupStructure>()
{
new RegexGroupStructureGroup() {
Name = "Group1",
SubStructures = new List<RegexGroupStructure> {
new RegexGroupStructureRegex() {
Regex = @"
,?\s+
"
},
new RegexGroupStructureGroup() {
Name = "Group1_Subgroup",
SubStructures = new List<RegexGroupStructure>() {
new RegexGroupStructureRegex() {
Regex = @"
[\d–-]+ # One or more digits, hyphen, and/or n-dash
"
}
}
},
new RegexGroupStructureRegex() {
Regex = @"
"
}
}
},
new RegexGroupStructureRegex() {
Regex = @"
"
},
}
};
RegexGroupStructureRoot actual = RegexGroupStructureParser.Parse(regex);
Assert.AreEqual(expected, actual);
}
Your solution’s test case does cause the thrown “Invalid named group name character” exception to halt at the
break;rather than thethrowline. I rigged up a test file using a nested if in a case to see if the exception triggers similarly in one of my projects and it did not: the halted line was thethrowstatement itself.However, when I enable editing (to use edit and continue in your project), the current line rewinds back to the throw statement. I haven’t looked at the generated IL, but I suspect that the throw (which will terminate the case without needing the “break” to follow as so:)
is being optimized in a way that is confusing the display, but not the actual execution or even the edit and continue feature. If the edit and continue works and the thrown exceptions are properly caught or displayed I suspect you have a display anomaly that you can ignore (although I would report it to Microsoft along with that file as it is reproducible).