I encountered a bit of code within the book I’ve been reading that has me questioning the SUBSTRING() function’s behavior. The code is supposed to search a NYSIIS Replacement table (phonetic encoding example) and replace the middle ‘N-gram’ of an input string based on the location ‘End’ ‘Mid’ or ‘Start’ in the table. an excerpt is provided below:
NYSIIS Replacement Table:
Location NGram Replacement Mid A A Mid AW AA Mid E A Mid EV AF Mid EW AA Mid I A
USE [AdventureWorks]
DECLARE @Result NVARCHAR(100) = N'NEVADA';
DECLARE @Replacement NVARCHAR(10);
DECLARE @i INT;
SET @i = 1;
WHILE @i <= LEN (@Result)
BEGIN
SET @Replacement = NULL;
-- Grab the middle-of-name replacement n-gram
SELECT TOP(1) @Replacement = Replacement
FROM dbo.NYSIIS_Replacements
WHERE Location = N'Mid'
AND SUBSTRING(@Result, @i, LEN(NGram)) = NGram
ORDER BY LEN(NGram) DESC;
SET @Replacement = COALESCE(@Replacement, SUBSTRING(@Result, @i, 1));
-- If we found a replacement, apply it
SET @Result = STUFF(@Result, @i, LEN(@Replacement), @Replacement)
-- Move on to the next n-gram
SET @i = @i + COALESCE(LEN(@Replacement), 1);
END;
SELECT @Result;
When the SUBSTRING() function encounters 2 possible matches using ‘NEVADA’ as an example (‘E’ and ‘EV’ in the table) how does it ‘know’ to use the 2 letter string as opposed to the one? Is this the expected behavior for SUBSTRING()?
I would assume the @Replacement variable would contain both ‘A’ and ‘AF’ but when debugging it only appears to contain ‘N’ in the first iteration and ‘AF’ in the second.
Also I could not understand why TOP and ORDER BY were included in this example. Commenting them out produces the same results.
The
ORDER BYclause uses the length of the pattern and sorts in descending order, hence the longest match will occur first. TheTOPclause limits the results to the first row. Removing theORDER BYclause makes the result unpredictable.COALESCEis used to set@Replacementto either the replacement pattern, or the character at position@iin the@Resultstring if no pattern match was found.