(Preface: I’m neither a statistician nor a programmer. I work in the humanities, so have mercy on my soul).
I need to calculate the Euclidean distance between a series of points in R. I’ve been using dist(), as follows:
> x <- c(0,0)
> y <- c(0,10)
> dist(rbind(x,y))
x
y 10
So far, so good. But when I was looking at my results (with real numbers), they were horribly off. So much so that I figured my R script was grabbing data from the wrong columns. But I checked, and it isn’t.
So I started playing around with toy numbers, and I was in for a surprise. The above example (a vertical line) works correctly, as does the following (a horizontal line):
> x <- c(0,10)
> y <- c(0,0)
> dist(rbind(x,y))
x
y 10
But when the line the two points form is diagonal, strangeness ensues:
> x <- c(0,10)
> y <- c(0,10)
> dist(rbind(x,y))
x
y 0
A distance of 0? Huh? That can’t be right.
And when the points are identical (that’s quite possible in my data), we go down the rabbit hole:
> x <- c(0,0)
> y <- c(10,10)
> dist(rbind(x,y))
x
y 14.14214
Should this not be 0? The points are identical, after all, so there can be no distance between them.
Just in case there’s something wrong with dist(), I tried to implement the formula manually, going by Wikipedia. Same results:
> sqrt(sum((x - y) ^ 2))
[1] 14.14214
As I said above, my math background is minimal, so I fully expect that the error here is mine. If so, please explain what it is and how to correct it. But from where I stand right now, it seems like something is very wrong.
And worst of all, I can’t analyze my data.
It looks like you want
dist(cbind(x, y)), notdist(rbind(x, y)).