As an educational exercise, I’ve set out to write a Python lexer in Python.
Eventually, I’d like to implement a simple subset of Python that can run itself, so I want this lexer to be written in a reasonably simple subset of Python with as few imports as possible.
The tutorials I have found involving lexing, for instance kaleidoscope, look ahead a single character to determine what token should come next, but I am afraid this is insufficient for Python (for one thing, just looking at one character you can’t differentiate between a delimiter or operator, or between an identifier and a keyword; furthermore, handling indentations look like a new beast to me; among other things).
I have found this link to be very helpful, however, when I tried implementing it, my code quickly started looking pretty ugly with a lot of if statements and casework, and it didn’t seem like it was the ‘right’ way to do it.
Are there any good resources out there that would help/teach me lex this kind of code (I’d also like to fully parse it, but first things first right?)?
I am not above using parser generators, but I want the resulting Python code to use a simple subset of Python, and also be reasonably self contained so that I can at least dream of having a language that can interpret itself. (For instance, from what I understand looking at this example, if I use ply, I will need my language to interpret the ply package as well to interpret itself, which I imagine would make things more complicated).
Look at http://pyparsing.wikispaces.com/ maybe you found it useful for your task.