I am putting together the last pattern for my flex scanner for parsing AWK

Question

0

Editorial Team

Asked: June 11, 20262026-06-11T19:05:34+00:00 2026-06-11T19:05:34+00:00

I am putting together the last pattern for my flex scanner for parsing AWK

0

I am putting together the last pattern for my flex scanner for parsing AWK source code.

I cannot figure out how to match the regular expressions used in the AWK source code as seen below:

{if ($0 ~ /^\/\// ){ #Match for "//" (Comment)

or more simply:

else if ($0 ~ /^Department/){

where the AWK regular expression is encapsulated within “/ /”.

All of the Flex patterns I have tried so far match my entire input file. I have tried changing the precedence of the regex pattern and have found no luck. Help would be greatly appreciated!!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-11T19:05:36+00:00

regexing regexen must be a meme somewhere. Anyway, let’s give it a try.

A gawk regex consists of:

/
any number of regex components
/

A regex component (simplified form — Note 1) is one of the following:

any character other than /, [ or \
a \ followed by any single character (we won’t get into linefeeds just now, though.
a character class (see below)

Up to here it’s easy. Now for the fun part.

A character class is:

[ or [^ or [] or [^] (Note 2)
any number of character class components
]

A character class component is (theoretically, but see below for the gawk bug) one of the following:

any single character other than ] or \ (Note 3)
a \ followed by any single character
a character class
a collation class

A character class is: (Note 5)

[:
a valid class name, which afaik is always a sequence of alpha characters, but it’s maybe safer not to make assumptions.
:]

A collation class is mostly unimplemented but partially parsed. You could probably ignore them, because it seems like gawk doesn’t get them right yet (Note 4). But for what it’s worth:

[.
some multicharacter collation character, like ‘ij’ in Dutch locale (I think).
.]

or an equivalence class:

[=
some character, or maybe also a multicharacter collation character
=]

An important point is the [/] does not terminate the regex. You don’t need to write [\/]. (You don’t need to do anything to implement that. I’m just mentioning it.).

Note 1:

Actually, the intepretation of \ and character classes, when we get to them, is a lot more complicated. I’m just describing enough of it for lexing. If you actually want to parse the regexen into their bits and pieces, it’s a lot more irritating.

For example, you can specify an arbitrary octet with \ddd or \xHH (eg \203 or \x4F). However, we don’t need to care, because nothing in the escape sequence is special, so for lexing purposes it doesn’t matter; we’ll get the right end of the lexeme. Similary, I didn’t bother describing character ranges and the peculiar rules for - inside a character class, nor do I worry about regex metacharacters (){}?*+. at all, since they don’t enter into lexing. You do have to worry about [] because it can implicitly hide a / from terminating the regex. (I once wrote a regex parser which let you hide / inside parenthesized expressions, which I thought was cool — it cuts down a lot on the kilroy-was-here noise (\/) — but nobody else seems to think this is a good idea.)

Note 2:

Although gawk does \ wrong inside character classes (see Note 3 below), it doesn’t require that you use them, so you can still use Posix behaviour. Posix behaviour is that the ] does not terminate the character class if it is the first character in the character class, possibly following the negating ^. The easiest way to deal with this is to let character classes start with any of the four possible sequences, which is summarized as:

\[^?]?

Note 3:

gawk differs from Posix ERE’s (Extended Regular Expressions) in that it interprets \ inside a character class as an escape character. Posix mandates that \ loses its special meaning inside character classes. I find it annoying that gawk does this (and so do many other regex libraries, equally annoying.) It’s particularly annoying that the gawk info manual says that Posix requires it to do this, when it actually requires the reverse. But that’s just me. Anyway, in gawk:

/[\]/]/

is a regular expression which matches either ] or /. In Posix, stripping the enclosing /s out of the way, it would be a regular expression which matches a \ followed by a / followed by a ]. (Both gawk and Posix require that ] not be special when it’s not being treated as a character class terminator.)

Note 4:

There’s a bug in the version of gawk installed on my machine where the regex parser gets confused at the end of a collating class. So it thinks the regex is terminated by the first second / in:

/[[.a.]/]/

although it gets this right:

/[[:alpha:]/]/

and, of course, putting the slash first always works:

/[/[:alpha:]]/

Note 5:

Character classes and collating classes and friends are a bit tricky to parse because they have two-character terminators. “Write a regex to recognize C /* */ comments” used to be a standard interview question, but I suppose it not longer is. Anyway, here’s a solution (for [:…:], but just substitute : for the other punctuation if you want to):

[[]:([^:]|:*[^]:])*:+[]]   // Yes, I know it's unreadable. Stare at it a while.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am putting together the last pattern for my flex scanner for parsing AWK

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply