This an optimized version of the tokenizer that was first written, and it works fairly well. A secondary tokenizer can parse the output from this function to create classified tokens of greater specificity.
def tokenize(source):
return (token for token in (token.strip() for line
in source.replace('\r\n', '\n').replace('\r', '\n').split('\n')
for token in line.split('#', 1)[0].split(';')) if token)
My question is this: how can this be written simply with the re module? Below is my ineffective attempt.
def tokenize2(string):
search = re.compile(r'^(.+?)(?:;(.+?))*?(?:#.+)?$', re.MULTILINE)
for match in search.finditer(string):
for item in match.groups():
yield item
Edit: This is the type of output that I am looking for from the tokenizer. Parsing the text should be easy.
>>> def tokenize(source):
return (token for token in (token.strip() for line
in source.replace('\r\n', '\n').replace('\r', '\n').split('\n')
for token in line.split('#', 1)[0].split(';')) if token)
>>> for token in tokenize('''\
a = 1 + 2; b = a - 3 # create zero in b
c = b * 4; d = 5 / c # trigger div error
e = (6 + 7) * 8
# try a boolean operation
f = 0 and 1 or 2
a; b; c; e; f'''):
print(repr(token))
'a = 1 + 2'
'b = a - 3 '
'c = b * 4'
'd = 5 / c '
'e = (6 + 7) * 8'
'f = 0 and 1 or 2'
'a'
'b'
'c'
'e'
'f'
>>>
I might be way off here-
If applicable, I would keep the
re.compileout of thedefscope.