I have to match all the alphanumeric words from a text.
>>> import re
>>> text = "hello world!! how are you?"
>>> final_list = re.findall(r"[a-zA-Z0-9]+", text)
>>> final_list
['hello', 'world', 'how', 'are', 'you']
>>>
This is fine, but further I have few words to negate i.e. the words that shouldn’t be in my final list.
>>> negate_words = ['world', 'other', 'words']
A bad way to do it
>>> negate_str = '|'.join(negate_words)
>>> filter(lambda x: not re.match(negate_str, x), final_list)
['hello', 'how', 'are', 'you']
But i can save a loop if my very first regex-pattern can be changed to consider negation of those words. I found negation of characters but i have words to negate, also i found regex-lookbehind in other questions, but that doesn’t help either.
Can it be done using python re?
Update
My text can span a few hundered lines. Also, list of negate_words can be lengthy too.
Considering this, is using regex for such task, correct in the first place?? Any suggestions??
I don’t think there is a clean way to do this using regular expressions. The closest I could find was bit ugly and not exactly what you wanted:
Why not use Python’s sets instead. They are very fast:
If order is important, see the reply from @glglgl below. His list comprehension version is very readable. Here’s a fast but less readable equivalent using itertools:
Another alternative is the build-up the word list in a single pass using re.finditer: