Given a specific word pattern (say, “balloon”), I would like to find the number of n words before and after, group by them, with a count that exist in the title of my table
For, example if the data set was:
- red balloon sky
- yellow balloon sky road
- blue balloon chair
I’d like the results to be something like:
- red balloon | 1
- yellow balloon | 1
- blue balloon | 1
- balloon sky | 2
- balloon chair | 1
I figured the best way to accomplish this would be with regex in my sproc. So, I added the great regex functions listed here, and the FindWordsInContext function.
To start with:
WITH Words_CTE (Title)
AS
-- Define the CTE query.
(
SELECT Title
FROM ItemData
WHERE Title LIKE '%balloon%'
)
-- Define the outer query referencing the CTE name.
SELECT Title
FROM Words_CTE
So I figured I would start with that and work the FindWordsInContext function into the mix, then do a grouping on the words/before a given word.
— UPDATE —
Thanks to Adrian Iftode below… but the code doesn’t exactly do what I’m looking for.
declare @table table(Sentence varchar(250))
insert into @table(sentence)
values ('I have another red balloon in the car.'),
('Here is a new balloon for you.'),
('A red balloon is in the other room.'),
('Is there another balloon for me?')
select TOP(5) SentencePart, NumberOfWords
from @table
cross apply dbo.fnGetPartsFromSentence(Sentence, 'balloon') f
order by
NumberOfWords DESC,
case when f.Side = 'R' then 0
else 1 end
Outputs:
balloon is in the other room. 5
I have another red balloon 4
Here is a new balloon 4
Is there another balloon 3
balloon in the car. 3
I would like to be able to set the range on either side of “balloon”. In this case, let’s say one word, the output should be:
red balloon 2
new balloon 1
another balloon 1
balloon in 1
balloon for 2
balloon is 1
Is a bit a lot of code, I’ll try to explain
First I used a split function, is going to split a varchar by a given varchar
Given the varchar ‘red balloon sky’ and when the split is the space character it will output :
The Side part means: if R then the space is on the right side of the word, if L then the space is on the left side of the word and if LR then the word is surrounded by spaces.
When the split is ‘balloon’
So the balloon appears on the right side of red and appears on the left side of sky
Having this helpful function I created another function which will output the required format for a single sentence (varchar)
This function uses the previous one. First it splits the sentence by word and counts the words in the splits by doing another split by space. When a word appears on both sides of the split, it repeats the split (that join with 1, 2 values).
This function will also output split concatenated with the word, depending on which side it is: left, right or both. It will also output the Side, this time is Left or Right.
Now using this function I can cross apply it with a table
The output is
Works and when there are multiple occurrences