I have a list of strings from which I need to remove all elements that match a substring from another list. I am trying to do this with lists, nested loops, and regex.
The output from the following snippet produces [“We don’t”, “need no”, “education”] instead of the desired [“education”]. I’m new to Python and this is my first experiment with regex, and I’m stuck on the sytax.
import re
testfile = ["We don't", "need no", "education"]
stopwords = ["We", "no"]
dellist = []
for x in range(len(testfile)):
for y in range(len(stopwords)):
if re.match(r'\b' + stopwords[y] + '\b', testfile[x], re.I):
dellist.append(testfile[x])
for x in range(len(dellist)):
if dellist[x] in testfile:
del testfile[testfile.index(dellist[x])]
print testfile
The line
if re.match(r'\b' + stopwords[y] + '\b', testfile[x], re.I):
returns “None” for all iterations through the loop, so I’m guessing this is where my problem lies…
It’s because
re.matchtests for a match from the start of the string.Try
re.searchinstead. Also, you’re missing theron your second'\b':Also, you could just use list comprehension to build up
dellist(you could probably use list comprehension to build up the newtestfileentirely, but it escapes me at the moment):Another thought – since you’re using
remodule anyway, why don’t you combine yourstopwordsinto\b(We|no)\band then you can just testtestfileagainst the one regex?Now you just have to look for words that don’t match that regex: