Hi I have been trying to implement the DBSCAN algorithm for Neo4j, but am

Question

0

Asked: June 12, 20262026-06-12T05:15:21+00:00 2026-06-12T05:15:21+00:00

Hi I have been trying to implement the DBSCAN algorithm for Neo4j, but am

0

Hi I have been trying to implement the DBSCAN algorithm for Neo4j, but am running into serious performance bottlenecks. I’ll describe the implementation then ask for help.

I discretized the possible epsilon values and put counts of the number of neighbors under each discretization in each node in order to be able to retrieve all of the core nodes in one query.

START a = node(*)
WHERE a.rel<cutoff threshold>! >= {minp}
RETURN a

This part is fast, the part that isn’t fast is the follow up query :

START a = node({i})
SET a.label<cutoff threshold>_<minpoints> = {clust}
WITH a
MATCH a -[:'|'.join(<valid distance relations>)]- (x)
WHERE not(has(x.label<cutoff threshold>_<minpoints>))
WITH x
SET x.label<cutoff threshold>_<minpoints>={clust}
RETURN x

I then pick a core node to start from, and as long as there are still core node neighbors, run the above query to label their neighbors.

I think the problem is that my graph has very different levels of sparsity – starting from only weak similarity it is almost fully connected, with ~50M relations between ~10k nodes, whereas at strong similarity there are as few as ~20k relations between ~10k nodes (or fewer). No matter what, it is always REALLY slow. What is the best way for me to handle this? Is it to index on relationship type and starting node? I haven’t been able to find any resources on this problem, and surprisingly there isn’t already an implementation since this is a pretty standard graph algorithm. I could use scikit.learn but then I would be restricted to in-memory distance matricies only 🙁

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-12T05:15:23+00:00

What version of neo4j did you try this with?

Up until 1.8 performance has been no design goal of cypher (rather the language)
Have a look at a recent snapshot (1.9-SNAP).

Also make sure that your hot dataset is not just loaded from disk (otherwise you measure disk-io) so your memory mapped settings and also JVM heap is large enough.

You might also want to check out the GCR cache from Neo4j enterprise which has a smaller memory footprint.

What is the cardinality of count(x) in your query? If it is too small you have too many small transactions going on. Depending if your run python embedded or REST use a larger tx-scope or REST-batch-operations

You’re already using parameters which is great. What is the variability of your rel-types ?

Any chance to share your dataset/generator and the code with us (Neo4j) for performance testing on our side?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Hi I have been trying to implement the DBSCAN algorithm for Neo4j, but am

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply