I have a table with 1,000,000+ records and I would like to find the most common sub string that is at least 5 characters long.
If I have the following entries:
KDHFOUDHGOENWFIJ 1114H4363SDFHDHGFDG
GSDLGJSLJSKJDFSG 1114H20SDGDSSFHGSLD
SLSJDHLJKSSDJFKD 1114HJSDHFJKSDKFSGG
I would like to write in SQL a statement that selects 1114H as the most commmon sub string. How can I do this?
Notes:
- The substring does not have to be in the same location.
- The subtrings must be length 5
- The maximum length of each record is 50 characters
There are no requirement to find the longest substring so every substring with length greater than 5 will always have a substring of 5 characters that is a tie for count. So we only have to check substrings of length 5.
In the sample data there are three strings that occur three times.
_1114H,_1114and1114H(_is to show the location of a space).In this solution
master..spt_valuesis used in place of a numbers table.Result: