Trying not to select same connection multiple times I have tried to use charindex() = 0 condition the following way:
WITH Cluster(calling_party, called_party, link_strength, Path)
AS
(SELECT
calling_party,
called_party,
link_strength,
CONVERT(varchar(max), calling_party + '.' + called_party) AS Path
FROM
monthly_connections_test
WHERE
link_strength > 0.1 AND
calling_party = 'b'
UNION ALL
SELECT
mc.calling_party,
mc.called_party,
mc.link_strength,
CONVERT(varchar(max), cl.Path + '.' + mc.calling_party + '.' + mc.called_party) AS Path
FROM
monthly_connections_test mc
INNER JOIN Cluster cl ON
(
mc.called_party = cl.called_party OR
mc.called_party = cl.calling_party
) AND
(
CHARINDEX(cl.called_party + '.' + mc.calling_party, Path) = 0 AND
CHARINDEX(cl.called_party + '.' + mc.called_party, Path) = 0
)
WHERE
mc.link_strength > 0.1
)
SELECT
calling_party,
called_party,
link_strength,
Path
FROM
Cluster OPTION (maxrecursion 30000)
The condition however does not fulfill its purpose as the same rows are selected multiple times.
The actual aim here is to retrieve the whole cluster of connections to which the selected user (in the example user b) belongs.
Edit1:
I tried to modify the query the following way:
With combined_users AS
(SELECT calling_party CALLING, called_party CALLED, link_strength FROM dbo.monthly_connections_test WHERE link_strength > 0.1),
related_users1 AS
(
SELECT c.CALLING, c.CALLED, c.link_strength, CONVERT(varchar(max), '.' + c.CALLING + '.' + c.CALLED + '.') path from combined_users c where CALLING = 'a1'
UNION ALL
SELECT c.CALLING, c.CALLED, c.link_strength,
convert(varchar(max),r.path + c.CALLED + '.') path
from combined_users c
join related_users1 r
ON (c.CALLING = r.CALLED) and CHARINDEX(c.CALLING + '.' + c.CALLED + '.', r.path)= 0
),
related_users2 AS
(
SELECT c.CALLING, c.CALLED, c.link_strength, CONVERT(varchar(max), '.' + c.CALLING + '.' + c.CALLED + '.') path from combined_users c where CALLED = 'a1'
UNION ALL
SELECT c.CALLING, c.CALLED, c.link_strength,
convert(varchar(max),r.path + c.CALLING + '.') path
from combined_users c
join related_users2 r
ON c.CALLED = r.CALLING and CHARINDEX('.' + c.CALLING + '.' + c.CALLED, r.path)= 0
)
SELECT CALLING, CALLED, link_strength, path FROM
(SELECT CALLING, CALLED, link_strength, path FROM related_users1 UNION SELECT CALLING, CALLED, link_strength, path FROM related_users2) r OPTION (MAXRECURSION 30000)
To test the query I created the cluster below:

The query above returned the following table:
a1 a2 1.0000000 .a1.a2.
a11 a13 1.0000000 .a12.a1.a13.a11.
a12 a1 1.0000000 .a12.a1.
a13 a12 1.0000000 .a12.a1.a13.
a14 a13 1.0000000 .a12.a1.a13.a14.
a15 a14 1.0000000 .a12.a1.a13.a14.a15.
a2 a10 1.0000000 .a1.a2.a10.
a2 a3 1.0000000 .a1.a2.a3.
a3 a4 1.0000000 .a1.a2.a3.a4.
a3 a6 1.0000000 .a1.a2.a3.a6.
a4 a8 1.0000000 .a1.a2.a3.a4.a8.
a4 a9 1.0000000 .a1.a2.a3.a4.a9.
The query obviously returns connections toward the selected node and the connections in the opposite direction. The problem is the change of direction: for example the connections a7, a4 and a11, a10 are not selected because the direction is changed (relative to starting node).
Does anyone know how to modify the query in order to include all connections?
Thank you
Okay, there are several things to discuss here.
Zerothly, i have PostgreSQL, so all this is done with that; i’m trying to only use standard SQL, so this should work okay with SQL Server too.
Firstly, if you’re only interested in calls with link strength greater than 0.1, let’s say:
And from now on, we’ll talk in terms of this table.
Secondly, you say:
If that’s true, why are you computing the path? If you just want to know the set of connections, you can do:
Thirdly, perhaps you don’t really want to find what connections are in the cluster, you want to find which people are in the cluster, and what the shortest path to them from the target is. In that case, you can do:
As you can see, you were very much on the right track.
If you actually do want a list of all the calls in the cluster, with paths, then, er, i’ll get back to you!
EDIT: Bear in mind that the queries which don’t construct a path will have very different performance characteristics to those which do. Roughly speaking, the non-path queries will do O(n) work (probably in about O(log n) iteration steps), because they visit each node in the cluster, but the path-constructing steps will do much more – O(n!) maybe? – because they have to visit every path through the graph. You’ll be alright if clusters are as big as in your examples, but if they’re much bigger, you may find the runtime is prohibitive.