I’ve got a large input matrix (4000×10000). I use dist() to calculate the Euclidean distance matrix for it (it takes about 5 hours).
I need to calculate the distance matrix for the “same” matrix with an additional row (for a 4001×10000 matrix). What is the fastest way to determine the distance matrix without recalculating the whole matrix?
I’ve got a large input matrix (4000×10000). I use dist() to calculate the Euclidean
Share
I’ll assume your extra row means an extra point. If it means an extra variable/dimension, it will call for a different answer.
First of all, for euclidean distance of matrices, I’d recommend the
rdistfunction from thefieldspackage. It is written in Fortran and is a lot faster than thedistfunction. It returns amatrixinstead of adistobject, but you can always go from one to the other usingas.matrixandas.dist.Here is (smaller than yours) sample data
and the distance matrix you already computed:
For the extra point(s), you only need to compute the distances among the extra points and the distances between the extra points and the original points. I will use two extra points to show that the solution is general to any number of extra points:
so you can bind them to your bigger distance matrix:
Let’s check that it matches what a full, long rerun would have produced: