Given the input:
str = "foo bar jim jam. jar jee joon."
I need the output of all 2- and 3-word phrases that are separated by spaces:
[ "foo bar", "bar jim", "jim jam", "jar jee", "jee joon",
"foo bar jim", "bar jim jam", "jar jee joon" ]
Note in particular the lack of “jam jar”, “jim jam jar” and “jam jar jee” in the above, due to the period.
I can’t use str.scan(/\w+/).each_cons(2).map{ |a| a.join(' ') } because that would include "jam jar".
Scanning for /\w+ \w+/ yields ["foo bar", "jim jam", "jar jee"], notably missing “bar jim” and “jee joon”, and highlighting the problem.
The real-world application for this is generating a phrase-based index for a search engine. I want to find all the truly-consecutive words as phrases, excluding those with punctuation separating words.
Edit: Seems like there might be a way to do this in regex/scan via a variation on:
"a b c d".scan(/(?=([abc] [abc]) )[abc]/)
#=> [["a b"], ["b c"]]
I believe this does the job, although it assumes the only punctuation is in the form of periods:
EDIT or with a little less repitition: