I am trying to extract hashtags for a simple college project using ruby on rails. I am facing issue with tags that include only numericals and with tags with no space.
text = "Pack my #box with #5 dozen liquor.#jugs link.com/liquor#jugs #2good #first#second"
The regex I have is /(?:^|\s)#(\w+)/i (source)
This regex returns #["box", "5", "2good", "first"]
How to make sure it only returns #["box", "2good"] and ignore the rest as they are not ‘real’ hashtags?
Can you try this regex:
UPDATE 1:
There are a few cases where the above regex will not match like: #blah23blah and #23blah23.
Hence modified the regex to take care of all cases.
Regex:
Breakdown:
(?:\s|^)–Matches the preceding space or start of line. Does notcapture the match.
#–Matches hash but does not capture.(?!\d+(?:\s|$)))–Negative Lookahead to avoid ALL numeric charactersbetween # and space (or end of line)
(\w+)–Matches and captures all word characters(?=\s|$)–Positive Lookahead to ensure following space or end ofline. This is required to ensure it matches adjacent valid hash tags.
Sample text modified to capture most cases:
Matches:
Match 1: blah
Match 2: box
Match 3: good2
Match 4: 3good
Match 5: mkvef214asdwq
Match 6: 3e4
Match 7: 2good
Rubular link
UPDATE 2:
To exclude words starting or ending with underscore, just include your exclusions in the negative lookahead like this:
The sample, regex and matches are recorded in this Rubular link