First without the details
I have data.frames like that one:
val1 val2 val3 val4 val5
1 1.1 2 1.1 2.1 4.2
2 5.7 5 5.6 4.9 9.9
3 3.1 3 3.2 2.9 5.9
4 9.6 1 9.5 1.0 2.0
and want to get the (nearly) equal rows. The desired result would be something like
[1] "val1" "val2" "val5"
because the column val3 is almost equal to val1, val4 is almost equal to val2 and val5 is different.
Details:
- What does “nearly” equal mean (just one of the options listed below):
- the absolute difference of the values is smaller than a fixed number (0.2 for the sample above)
- the relative difference of the values is smaller than a fixed number (~11% for the sample)
- other metrics which make sense 😉
- a listing of linearly dependent columns would be even better (but I think that’s way more complicated) (that would mean that
val5is also part of the group which is formed byval2andval4since it’s roughly twice the value) - it has not to be really fast,
O(n^2)would be okay. (my frames are only about 12 rows and 300 columns) - if that should not be possible, a list of exactly equal columns would somehow work, too. Then I would apply the
round()function before
It’s not quite well-defined how to choose which rows are equal; for instance, you could have three columns where A and B are “equal” and B and C are “equal” but A and C are not. What to do then? One way around that might be to use hierarchical clustering, maybe like this:
Using the data from Andrie’s answer, first transpose it and make it into a matrix; I’ll also standardize each row (what was a column) as a start at finding linear combinations; this will group rows that are exact multiple of each other but not more complex combinations.
We now make a tree, and for interest, plot it. This uses the default distance function (Euclidean) but others are possible.
We then choose where to cut the tree into groups (this is where you choose how close two have to be to be “equal”); I output it together with the sum of values to see if any are multiples of another.