I have three a string that is the concatenation of three components:
- one word from list 1 (includes an empty string)
- one word from list 2
- one word from list 3 (includes an empty string)
Lists 1, 2 and 3 can be up to 5000 elements. Elements in one list are not in the others (except of the empty string). However, there are words that can be part of other words.
I am looking for an efficient way to find the three components. Right now I am doing the following:
for word in list2:
if word in long_word:
try:
[bef, aft] = long_word.split(word)
except ValueError: # too many values to unpack
continue
if bef in list1 and aft in list3:
print('Found: {}, {}, {}'.format(bef, word, aft))
break
else:
print('Not found')
I wonder if there is a better way. I thought about using the pipe in a regex. But it seems that the number of alternatives are too big as I get: OverflowError: regular expression code size limit exceeded.
Thanks,
Update
I tried a modified version of the proposed solutions:
def fj(long_word, list1, list2, list3):
for x in filter(long_word.startswith, list1):
for y in filter(long_word[len(x):].startswith, list2):
z = long_word[len(x)+len(y):]
if z in list3:
yield x, y, z
def sid(long_word, list1, list2, list3):
for w1 in list1:
if not long_word.startswith(w1):
continue
cut1 = long_word[len(w1):]
for w2 in list2:
if not cut1.startswith(w2):
continue
cut2 = cut1[len(w2):]
for w3 in list3:
if cut2 == w3:
yield w1, w2, w3
def my(long_word, list1, list2, list3):
for word in list2:
if word in long_word:
try:
[bef, aft] = long_word.split(word)
except ValueError: # too many values to unpack
continue
if bef in list1 and aft in list3:
yield bef, word, aft
This are the (normalized) results for the timing that I get using lists with 8000 elements repeating 10000 times, each time picking randomly one word from each list to generate long_word
- my: 1.0
- sid: 4.5
- fj: 2.7
I am really surprised as I thought that fj’s method was going to be fastest.
Regular expressions probably aren’t a great fit here, I would probably go about it like this: