I’m building a basic lexer in PHP, just as an exercise. Right now I’m making it lex PHP source and output highlighted source via HTML tags, but I’m using real token names and stuff, not just a few broad regex matches.
The way I’m setting it up is to read in the PHP source, character by character. It checks the current character to figure out what the current token might be and then reads in the next x characters that match the appropriate pattern.
For example, if the current character is a “, I’ll read in all characters until I encounter another ” which wasn’t preceded by an escaping \. Is this a bad way of doing it? The only other way I’ve seen and understood involved making a class that compiled a massive regex and matched all tokens at once, but that doesn’t seem as flexible to me.
Thoughts?
$str = '';
$php = str_replace( "\r\n", "\n", $php );
$php = str_split( $php );
$len = count( $php );
$keyword = '';
for ( $i = 0; $i < $len; $i++ ) {
$char = $php[$i];
// Detect PHP strings and backtick execution operators
if ( strpos( self::STRING_CHARACTERS, $char ) !== FALSE ) {
$string = $char;
$opening_quote = $char;
$escaped = FALSE;
while ( isset( $php[++$i] ) && ( $escaped || $php[$i] != $opening_quote ) ) {
$string .= $php[$i];
if ( $php[$i] == '\\' ) {
$escaped = !$escaped;
}
}
$string .= $php[$i];
if ( $opening_quote == "'" ) {
$str .= '<span class="php-string php-single-quoted-string">' . htmlspecialchars( $string ) . '</span>';
} else if ( $opening_quote == '"' ) {
$str .= '<span class="php-string php-double-quoted-string">' . htmlspecialchars( $string ) . '</span>';
} else if ( $opening_quote == '`' ) {
$str .= '<span class="php-execution-operator php-backtick">' . htmlspecialchars( $string ) . '</span>';
}
continue;
}
$str .= $char;
}
If you’re intending to keep it a hand-written tool, then definitely keep going with you current approach.
The giant matching engine approach is fantastic if you’re writing a tool such as
flexor ANTLR, and you want to be able to build highly efficient parsers all day long for a variety of languages. But it is a fair amount of extra effort if you’re interested in parsing only one language.