I have this parent-child relationship
Paragraph
---------
ParagraphID PK
// other attributes ...
Sentence
--------
SentenceID PK
ParagraphID FK -> Paragraph.ParagraphID
Text nvarchar(4000)
Offset int
Score int
// other attributes ...
I’d like to find paragraphs that are equivalent; that is paragraphs that contain the same set of sentences. Two sentences are considered the same if they have the same Text, Offset and Score – SentenceID/ParagraphID is not part of the comparison, and two paragraphs are equivalent if they contain an equal set of sentences.
Could someone show me what a query to find equal paragraphs would look like?
EDIT: There are ca. 150K paragraphs, and 1.5M sentences. The output should include the ParagraphID, and the lowest paragraph ID that is equivalent to this one. E.g. if paragraph1 and paragraph2 are equal, then output would be
ParagraphID EquivParagraphID
1 1
2 1
In short, you need a signature for each paragraph and then compare the signatures. You did not mention the nature of the output itself. Here, I”m returning a row of comma-delimited ParagraphId values for each identical paragraph signature.
Given you addition about the desired output, you can change the query like so:
Obviously, it might be possible that three or four paragraphs share the same signature, so be warned that the above results will give you a cartesian product of matching paragraphs. (e.g. (P1,P2), (P1,P3), (P2,P1), (P2,P3), (P3,P1), (P3,P2)).
In comments you asked about effectively searching on sentence last. Since you have two other parameters, you could reduce the number of signatures generated by doing by comparing on the two int columns first: