Python’s tokenize returns all the found tokens’ position as two tuples of (startRow, startCol)

Question

0

Asked: June 2, 20262026-06-02T20:48:08+00:00 2026-06-02T20:48:08+00:00

Python’s tokenize returns all the found tokens’ position as two tuples of (startRow, startCol)

0

Python’s tokenize returns all the found tokens’ position as two tuples of (startRow, startCol) and (endRow, endCol).

Is there a way to return the positions as the offsets from the beginning of the string? That is, I would like to get rid of (row, col) in favor of just “offset”.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-02T20:48:10+00:00

There isn’t one built-in to tokenize.

If you had access to the same set of lines being used by the tokenizer, you could run through and store the accumulated “total length of lines before line X” into a list, and then use that to convert the row values into additive offsets.

For instance:

import tokenize

def tokens_with_offset(path):
    line_offsets = []
    line_offset_accum = 0
    with open(path) as f:
        for line in f:
            line_offsets.append(line_offset_accum)
            line_offset_accum += len(line)

    with open(path) as f:
        for ttype, tstring, tbegin, tend, tline in tokenize.generate_tokens(f.readline):
            offset_begin = line_offsets[tbegin[0]] + tbegin[1]
            offset_end = line_offsets[tend[0]] + tend[1]
            yield ttype, tstring, offset_begin, offset_end, tline

(Note: haven’t tested this code, it’s more as an example of the general concept.)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Python’s tokenize returns all the found tokens’ position as two tuples of (startRow, startCol)

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply