I am trying to create a regular expression in Python that matches #hashtags. My definition on a hashtag is:
- It is a work that starts with a
# - It can contain all characters except
[ ,\.] - It can be anywhere in the text
So in this text
#This string cont#ains #four, and #only four #hashtags.
The hashes here are This, four, only and hashtags.
The problem I have is the optional check for the beginning of line.
[ \.,]+won’t do it since it won’t match the optional beginning.[ \.,]?won’t do it since it matches too much.
Example with +
In []: re.findall('[ \.,]+#([^ \.,]+)', '#This string cont#ains #four, and #only four #hashtags.')
Out[]: ['four', 'only', 'hashtags']
Example with ?
In []: re.findall('[ \.,]?#([^ \.,]+)', '#This string cont#ains #four, and #only four #hashtags.')
Out[]: ['This', 'ains', 'four', 'only', 'hashtags']
How can optional match the beginning of the line?
This seems to work:
\B: Matches the empty string, but only when it is not at the beginning or end of a word. This means thatr'py\B'matches'python','py3','py2', but not'py','py.', or'py!'.\Bis just the opposite of\b, so is also subject to the settings ofLOCALEandUNICODE.\W: When theLOCALEandUNICODEflags are not specified, matches any non-alphanumeric character; this is equivalent to the set[^a-zA-Z0-9_]. With LOCALE, it will match any character not in the set[0-9_], and not defined as alphanumeric for the current locale. IfUNICODEis set, this will match anything other than[0-9_]plus characters classied as not alphanumeric in the Unicode character properties database.