Please help me to build word pairs frequency table from table with 100 mln records that is work on SQL Server 2008 db.
Table looks like:
Original table
id |source |comment(255)
-------------------
1 A1 review budget limitation
source is some ID that has could have about 800 different values. Distribution of sources in original table is exponential. That means amount of records with source A1 could be 20 mln and A500 is only 10,000.
In final I would like to get a word pairs frequency table with ignoring words:
the, and, of, to, a, i, it, in, or, is
How I expected it should work (I could be not optimal here):
- read first two words from comment in original table, put it to FREQUENCY
- read next two words and put it
Frequency table
id | word pairs | source |Frequency
---------------------------------------------
1 review budget A1 1
2 budget limitation A1 1
- Fill in full comment from first record that has for example source A1
- Start next record and process it in the same way.
- If it found same word pairs already exist in Frequency table and source is the same than just increment Frequency, if source is different – add this pair with new source.
Please help me with optimal sql script for SQL Server ?
I’ll work this out in a minute (given time) but I’d like to put forth two imperatives:
With these three rules you can move tons of data. After you built the select-statement it’s just a matter of dumping it into a table.
EDIT:
Then, use that to build a select
It’s possible that I’m off by some edge case, but it should give you an idea of what I’m going for.
It also doesn’t consider “word1 word2” and “word2 word1” to be equal.
I leave that as an exercise to the reader :p
EDIT:
Added
TABLEkeyword onRETURNSline.Also, assigning a value in the
DECLAREonly works starting from SQL 2008 I think..EDIT:
Added
RETURNstatementEDIT:
Changes per AntarticIce’s feedback