I need to split a string into words, but also get the starting and

Question

0

Asked: May 31, 20262026-05-31T00:35:56+00:00 2026-05-31T00:35:56+00:00

I need to split a string into words, but also get the starting and

0

I need to split a string into words, but also get the starting and ending offset of the words. So, for example, if the input string is:

input_string = "ONE  ONE ONE   \t TWO TWO ONE TWO TWO THREE"

I want to get:

[('ONE', 0, 2), ('ONE', 5, 7), ('ONE', 9, 11), ('TWO', 17, 19), ('TWO', 21, 23),
 ('ONE', 25, 27), ('TWO', 29, 31), ('TWO', 33, 35), ('THREE', 37, 41)]

I’ve got some working code that does this using input_string.split and calls to .index, but it’s slow. I tried to code it by manually iterating through the string, but that was slower still. Does anyone have a fast algorithm for this?

Here are my two versions:

def using_split(line):
    words = line.split()
    offsets = []
    running_offset = 0
    for word in words:
        word_offset = line.index(word, running_offset)
        word_len = len(word)
        running_offset = word_offset + word_len
        offsets.append((word, word_offset, running_offset - 1))

    return offsets

def manual_iteration(line):
    start = 0
    offsets = []
    word = ''
    for off, char in enumerate(line + ' '):
        if char in ' \t\r\n':
            if off > start:
                offsets.append((word, start, off - 1))
            start = off + 1
            word = ''
        else:
            word += char

    return offsets

By using timeit, “using_split” is the fastest, followed by “manual_iteration”, then the slowest so far is using re.finditer as suggested below.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-31T00:35:58+00:00

The following runs slightly faster – it saves about 30%. All I did was define the functions in advance:

def using_split2(line, _len=len):
    words = line.split()
    index = line.index
    offsets = []
    append = offsets.append
    running_offset = 0
    for word in words:
        word_offset = index(word, running_offset)
        word_len = _len(word)
        running_offset = word_offset + word_len
        append((word, word_offset, running_offset - 1))
    return offsets

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I need to split a string into words, but also get the starting and

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply