I am trying to write a regular expression that will count the number of times two words co-occur within a certain proximity (within 5 words of each other) in a string, without double counting words.
For example, if I had a string:
“The man liked his big hat. The hat was very big.”
In this case, the regex should see the “big hat” in the first sentence and the “hats are big” in the second sentence, returning a total of 2. Note that in the second sentence, there are several words between “hat” and “big”, they also appear in a different order than the first sentence, but they still occur within a 5-word window.
If regular expressions are not the correct way to approach this problem, please let me know what I should try instead.
A bit like Stephen C but using library classes to assist in the mechanics.