I’m building a basic lexer in PHP, just as an exercise. Right now I’m

Question

0

Asked: June 5, 20262026-06-05T09:31:40+00:00 2026-06-05T09:31:40+00:00

I’m building a basic lexer in PHP, just as an exercise. Right now I’m

0

I’m building a basic lexer in PHP, just as an exercise. Right now I’m making it lex PHP source and output highlighted source via HTML tags, but I’m using real token names and stuff, not just a few broad regex matches.

The way I’m setting it up is to read in the PHP source, character by character. It checks the current character to figure out what the current token might be and then reads in the next x characters that match the appropriate pattern.

For example, if the current character is a “, I’ll read in all characters until I encounter another ” which wasn’t preceded by an escaping \. Is this a bad way of doing it? The only other way I’ve seen and understood involved making a class that compiled a massive regex and matched all tokens at once, but that doesn’t seem as flexible to me.

Thoughts?

    $str = '';

    $php = str_replace( "\r\n", "\n", $php );
    $php = str_split( $php );
    $len = count( $php );
    $keyword = '';

    for ( $i = 0; $i < $len; $i++ ) {
        $char = $php[$i];

        // Detect PHP strings and backtick execution operators
        if ( strpos( self::STRING_CHARACTERS, $char ) !== FALSE ) {
            $string         = $char;
            $opening_quote  = $char;
            $escaped        = FALSE;

            while ( isset( $php[++$i] ) && ( $escaped || $php[$i] != $opening_quote ) ) {
                $string .= $php[$i];

                if ( $php[$i] == '\\' ) {
                    $escaped = !$escaped;
                }
            }

            $string .= $php[$i];

            if ( $opening_quote == "'" ) {
                $str .= '<span class="php-string php-single-quoted-string">' . htmlspecialchars( $string ) . '</span>';
            } else if ( $opening_quote == '"' ) {
                $str .= '<span class="php-string php-double-quoted-string">' . htmlspecialchars( $string ) . '</span>';
            } else if ( $opening_quote == '`' ) {
                $str .= '<span class="php-execution-operator php-backtick">' . htmlspecialchars( $string ) . '</span>';
            }
            continue;
        }

        $str .= $char;
    }

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-05T09:31:42+00:00

Editorial Team

2026-06-05T09:31:42+00:00Added an answer on June 5, 2026 at 9:31 am

If you’re intending to keep it a hand-written tool, then definitely keep going with you current approach.

The giant matching engine approach is fantastic if you’re writing a tool such as flex or ANTLR, and you want to be able to build highly efficient parsers all day long for a variety of languages. But it is a fair amount of extra effort if you’re interested in parsing only one language.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m building a basic lexer in PHP, just as an exercise. Right now I’m

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply