I need to split a string into words, but also get the starting and ending offset of the words. So, for example, if the input string is:
input_string = "ONE ONE ONE \t TWO TWO ONE TWO TWO THREE"
I want to get:
[('ONE', 0, 2), ('ONE', 5, 7), ('ONE', 9, 11), ('TWO', 17, 19), ('TWO', 21, 23),
('ONE', 25, 27), ('TWO', 29, 31), ('TWO', 33, 35), ('THREE', 37, 41)]
I’ve got some working code that does this using input_string.split and calls to .index, but it’s slow. I tried to code it by manually iterating through the string, but that was slower still. Does anyone have a fast algorithm for this?
Here are my two versions:
def using_split(line):
words = line.split()
offsets = []
running_offset = 0
for word in words:
word_offset = line.index(word, running_offset)
word_len = len(word)
running_offset = word_offset + word_len
offsets.append((word, word_offset, running_offset - 1))
return offsets
def manual_iteration(line):
start = 0
offsets = []
word = ''
for off, char in enumerate(line + ' '):
if char in ' \t\r\n':
if off > start:
offsets.append((word, start, off - 1))
start = off + 1
word = ''
else:
word += char
return offsets
By using timeit, “using_split” is the fastest, followed by “manual_iteration”, then the slowest so far is using re.finditer as suggested below.
The following runs slightly faster – it saves about 30%. All I did was define the functions in advance: