This an optimized version of the tokenizer that was first written, and it works

Question

0

Asked: June 9, 20262026-06-09T21:07:25+00:00 2026-06-09T21:07:25+00:00

This an optimized version of the tokenizer that was first written, and it works

0

This an optimized version of the tokenizer that was first written, and it works fairly well. A secondary tokenizer can parse the output from this function to create classified tokens of greater specificity.

def tokenize(source):
    return (token for token in (token.strip() for line
            in source.replace('\r\n', '\n').replace('\r', '\n').split('\n')
            for token in line.split('#', 1)[0].split(';')) if token)

My question is this: how can this be written simply with the re module? Below is my ineffective attempt.

def tokenize2(string):
    search = re.compile(r'^(.+?)(?:;(.+?))*?(?:#.+)?$', re.MULTILINE)
    for match in search.finditer(string):
        for item in match.groups():
            yield item

Edit: This is the type of output that I am looking for from the tokenizer. Parsing the text should be easy.

>>> def tokenize(source):
    return (token for token in (token.strip() for line
            in source.replace('\r\n', '\n').replace('\r', '\n').split('\n')
            for token in line.split('#', 1)[0].split(';')) if token)

>>> for token in tokenize('''\
a = 1 + 2; b = a - 3 # create zero in b
c = b * 4; d = 5 / c # trigger div error

e = (6 + 7) * 8
# try a boolean operation
f = 0 and 1 or 2
a; b; c; e; f'''):
    print(repr(token))


'a = 1 + 2'
'b = a - 3 '
'c = b * 4'
'd = 5 / c '
'e = (6 + 7) * 8'
'f = 0 and 1 or 2'
'a'
'b'
'c'
'e'
'f'
>>>

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-09T21:07:27+00:00

I might be way off here-

>>> def tokenize(source):
...     search = re.compile(r'^(.+?)(?:;(.+?))*?(?:#.+)?$', re.MULTILINE)
...     return (token.strip() for line in source.split('\n') if search.match(line)
...                   for token in line.split('#', 1)[0].split(';') if token)
... 
>>> 
>>> 
>>> for token in tokenize('''\
... a = 1 + 2; b = a - 3 # create zero in b
... c = b * 4; d = 5 / c # trigger div error
... 
... e = (6 + 7) * 8
... # try a boolean operation
... f = 0 and 1 or 2
... a; b; c; e; f'''):
...     print(repr(token))
... 
'a = 1 + 2'
'b = a - 3'
'c = b * 4'
'd = 5 / c'
'e = (6 + 7) * 8'
'f = 0 and 1 or 2'
'a'
'b'
'c'
'e'
'f'
>>>

If applicable, I would keep the re.compile out of the def scope.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

This an optimized version of the tokenizer that was first written, and it works

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply