I have this piece of code that finds words that begin with @ or #,
p = re.findall(r'@\w+|#\w+', str)
Now what irks me about this is repeating \w+. I am sure there is a way to do something like
p = re.findall(r'(@|#)\w+', str)
That will produce the same result but it doesn’t, it instead returns only # and @. How can that regex be changed so that I am not repeating the \w+? This code comes close,
p = re.findall(r'((@|#)\w+)', str)
But it returns [('@many', '@'), ('@this', '@'), ('#tweet', '#')] (notice the extra ‘@’, ‘@’, and ‘#’.
Also, if I’m repeating this re.findall code 500,000 times, can this be compiled and to a pattern and then be faster?
The solution
You have two options:
(?:@|#)\w+[@#]\w+References
Understanding
findallThe problem you were having is due to how
findallreturn matches depending on how many capturing groups are present.Let’s take a closer look at this pattern (annotated to show the groups):
Capturing groups allow us to save the matches in the subpatterns within the overall patterns.
Now let’s take a look at the Python documentation for the
remodule:This explains why you’re getting the following:
As specified, since the pattern has more than one group,
findallreturns a list of tuples, one for each match. Each tuple gives you what were captured by the groups for the given match.The documentation also explains why you’re getting the following:
Now the pattern only has one group, and
findallreturns a list of matches for that group.In contrast, the patterns given above as solutions doesn’t have any capturing groups, which is why they work according to your expectation:
References
remoduleAttachments