I’m using python 2.7 and BeautifulSoup.
I need to find an acronym such as abc or a.b.c. and avoid false positive like qweabcrty. The pattern can be at the beginning of the string, at the end, can have space, quote, double quotes, hyphen (and so on) right before and after but not an alphanumeric character.
I come to this regex
[^\w]?a\.?b\.?c\.?[^\w]?
That is ok for
- abc
- a.b.c.
- blah (abc)
- abc-blah
- blah-abc
- blah abc blah
- blah-abc-blah
But it is also found (and I don’t want to):
- qweabcrty
If I remove the ? after both [^\w] it will not find anymore case 1, 2, 4 and 5, because it expects to find something before and/or after.
Long story short, how can I specify this:
abc can be anywere in the string BUT IF there is a character before and/or after it must not be an alphanumeric one.
The python code looks like:
import re
from bs4 import BeautifulSoup, SoupStrainer
html = """
<html>
<a>abc</a>
<a>a.b.c.</a>
<a>blah (abc)</a>
<a>abc-blah</a>
<a>blah-abc</a>
<a>blah abc blah</a>
<a>blah-abc-blah</a>
<a>qweabcrty</a>
</html>"""
links = BeautifulSoup(html, "lxml", parse_only=SoupStrainer(["a"]))
tags = links.find_all("a", text = re.compile("[^\w]?a\.?b\.?c\.?[^\w]?", re.I))
print tags
Try using the word boundary (
\b) metacharacter:prints