Please help me to build word pairs frequency table from table with 100 mln

Question

0

Asked: June 4, 20262026-06-04T22:26:49+00:00 2026-06-04T22:26:49+00:00

Please help me to build word pairs frequency table from table with 100 mln

0

Please help me to build word pairs frequency table from table with 100 mln records that is work on SQL Server 2008 db.
Table looks like:

Original table 
id |source |comment(255)
-------------------
1     A1     review budget limitation

source is some ID that has could have about 800 different values. Distribution of sources in original table is exponential. That means amount of records with source A1 could be 20 mln and A500 is only 10,000.

In final I would like to get a word pairs frequency table with ignoring words:
the, and, of, to, a, i, it, in, or, is

How I expected it should work (I could be not optimal here):

read first two words from comment in original table, put it to FREQUENCY
read next two words and put it

Frequency table

id | word pairs        | source |Frequency
 ---------------------------------------------
1   review budget         A1         1
2   budget limitation     A1         1

Fill in full comment from first record that has for example source A1
Start next record and process it in the same way.
If it found same word pairs already exist in Frequency table and source is the same than just increment Frequency, if source is different – add this pair with new source.

Please help me with optimal sql script for SQL Server ?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-04T22:26:51+00:00

I’ll work this out in a minute (given time) but I’d like to put forth two imperatives:

Anything that needs to be done fast in SQL needs to be done set-based. Avoid processing things “one at a time”.
Use a table valued function to split up the comments into a table of word pairs
Use Common Table Expressions to layer your work to keep things readable

With these three rules you can move tons of data. After you built the select-statement it’s just a matter of dumping it into a table.

EDIT:

CREATE FUNCTION dbo.SplitToPairs(@sText nvarchar(255))
RETURNS @Pairs TABLE (
    Pair nvarchar(255) NOT NULL
)
AS
BEGIN
    SET @sText = LTRIM(RTRIM(@sText));
    DECLARE @Pos1 int = 0
    DECLARE @Pos2 int = CHARINDEX(' ', @sText);
    DECLARE @Pos3 int;
    IF @Pos2 <> 0
    BEGIN
        DECLARE @Word1 nvarchar(255) = SUBSTRING(@sText, @Pos1+1, @Pos2-@Pos1-1);
        WHILE CHARINDEX(N'|' + @Word1 + N'|', N'|the|and|of|to|a|i|it|in|or|is|') <> 0
        BEGIN
            SET @Pos1 = @Pos2;
            SET @Pos3 = CHARINDEX(' ', @sText, @Pos2+1);
            SET @Pos2 = @Pos3;
            SET @Word1 = SUBSTRING(@sText, @Pos1+1, @Pos2-@Pos1-1);
        END
        DECLARE @Word2 nvarchar(255);

        WHILE @Pos2 <> 0
        BEGIN
            SET @Pos3 int = CHARINDEX(' ', @sText, @Pos2+1);
            IF @Pos3 <> 0
            BEGIN
                SET @Word2 = SUBSTRING(@sText, @Pos2+1, @Pos3-@Pos2-1);
                WHILE CHARINDEX(N'|' + @Word2 + N'|', N'|the|and|of|to|a|i|it|in|or|is|') <> 0
                BEGIN
                    SET @Pos1 = @Pos2;
                    SET @Pos2 = @Pos3;
                    SET @Word2 = SUBSTRING(sText, @Pos2+1, @Pos2-@Pos1-1);
                END
                INSERT @Pairs (Pair) VALUES (@Word1 + N' ' + @Word2)
            END

            SET @Pos1 = @Pos2;
            SET @Pos2 = @Pos3;
            SET @Word1 = @Word2;
        END
    END
    -- Note: if only one word in text, no insert happens
    RETURN @Pairs
END

Then, use that to build a select

SELECT I.Source, P.Pair, COUNT(*) AS Frequency
FROM Information AS I CROSS APPLY dbo.SplitToPairs(i.Comment) AS P
GROUP BY I.Source, P.Pair

It’s possible that I’m off by some edge case, but it should give you an idea of what I’m going for.
It also doesn’t consider “word1 word2” and “word2 word1” to be equal.

I leave that as an exercise to the reader :p

EDIT:

Added TABLE keyword on RETURNS line.

Also, assigning a value in the DECLARE only works starting from SQL 2008 I think..

EDIT:

Added RETURN statement

EDIT:

Changes per AntarticIce’s feedback

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Please help me to build word pairs frequency table from table with 100 mln

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply