I learnt that C gets translated to assembly and then assembly gets translated to machine code. And I learnt how to translate basic C constructs such as pointers and loops to 32-bit MIPS assembly. But I didn’t learn how to translate regexes in for instance C to assembly, is there a recipe?
Share
Translating regular expressions to assembly language seems to have gone out of style a couple decades ago. Instead, these days they’re usually compiled to deterministic finite automata (DFA), often with an intermediate step as a non-deterministic finite automaton (NFA). If you’re unfamiliar with these terms, see:
The NFA corresponding to a regex is pretty easily constructed; just consider each point in the regex as a state, and the set of characters that can match and move you to the next point in the regex as the transitions from that state to the next state.
Other popular regex engines, including PCRE, don’t compile the regex at all but use a backtracking matcher, which is simple to write, but has pathologically bad memory usage (many recursive call frames, leading to stack overflow, if implemented as actual function calls) and pathologically bad big-O performance (can be exponential time).