I am now reading the book Data Mining: Practical machine learning tools and techniques

Question

0

Asked: June 13, 20262026-06-13T22:46:15+00:00 2026-06-13T22:46:15+00:00

I am now reading the book Data Mining: Practical machine learning tools and techniques

0

I am now reading the book Data Mining: Practical machine learning tools and techniques third edition. In the section 4.8 clustering, it discusses how to use k-d trees or ball trees to improve the performance for the k-means algorithm.

After building the ball tree with all the data points, it searches all the leaf nodes to see which pre-chosen clustering center the points in it are each close to. It says sometimes the region represented by the higher interior node falls entirely within the domain of a single cluster center. Then we needn’t traverse its child nodes and all the date points can be processed in one blow.

The question is, when implementing the data structure and the algorithm, how can we decide whether the region referring to an interior node falls into a single cluster center?

In a two-dimensional or three-dimensional space, this is not difficult. We can see whether all the midperpendiculars of every pair in the cluster centres come across the region referring to the interior node.

But in higher dimensional spaces, how to recognize that? Is there a general methodology for this?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-13T22:46:15+00:00

You need to consider maximum and minimum distances.

If the minimum distance of a spatial object (say, a sphere of radius r) to all other means is larger than the maximum distance to one, all objects inside the container will belong to that mean. Because if

maxdist(mean_i, container) < min of all j != i mindist(mean_j, container)

then in particular for any object in the container

dist(mean_i, obj_in_container) < min of all j != i dist(mean_j, obj_in_container)

I.e. the object will belong to mean i.

Minimum and Maximum distances for spheres and rectangles can be trivially computed in arbitrary dimensions. However, in higher dimensions, mindist and maxdist become quite similar, and the condition will rarely hold. Plus, it makes a huge difference if your tree is good structured (i.e. small containers) or badly structured (overlapping containers).

k-d-trees are nice for in-memory, read-only operations. For insertions they perform quite bad. R*-trees are here a lot better. Plus, the improved split strategy of R*-trees does pay off, because it generates more rectangular boxes than the other strategies.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am now reading the book Data Mining: Practical machine learning tools and techniques

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply