I have for the moment a function which looks for the best correlation for each file I have in my correlation matrix. I’m working on a list.files. Here’s the function:
get.max.cor <- function(station, mat){
mat[row(mat) == col(mat)] <- -Inf
which( mat[station, ] == max(mat[station, ],na.rm=TRUE) )
}
If I have a correlation matrix like this (no NA-value):
cor1 <- read.table(text="
ST208 ST209 ST210 ST211 ST212
ST208 1.0000000 0.8646358 0.8104837 0.8899451 0.7486417
ST209 0.8646358 1.0000000 0.9335584 0.8392696 0.8676857
ST210 0.8104837 0.9335584 1.0000000 0.8304132 0.9141465
ST211 0.8899451 0.8392696 0.8304132 1.0000000 0.8064669
ST212 0.7486417 0.8676857 0.9141465 0.8064669 1.0000000
", header=TRUE)
It works perfectly. If I have a correlation matrix with some NAs (but not only NAs) like this:
cor2 <- read.table(text="
ST208 ST209 ST210 ST211 ST212
ST208 1.0000000 NA 0.9666491 0.9573701 0.9233598
ST209 NA 1.0000000 0.9744054 0.9577192 0.9346706
ST210 0.9666491 0.9744054 1.0000000 0.9460145 0.9582683
ST211 0.9573701 0.9577192 0.9460145 1.0000000 NA
ST212 0.9233598 0.9346706 0.9582683 NA 1.0000000
", header=TRUE)
It still works thanks to na.rm=TRUE, but when I have one file with no data, and so only NAs in the column like this:
cor3 <- read.table(text="
ST208 ST209 ST210 ST211 ST212
ST208 1.0000000 NA 0.8104837 0.8899451 0.7486417
ST209 NA NA NA NA NA
ST210 0.8104837 NA 1.0000000 0.8304132 0.9141465
ST211 0.8899451 NA 0.8304132 1.0000000 0.8064669
ST212 0.7486417 NA 0.9141465 0.8064669 1.0000000
", header=TRUE)
It doesn’t work of course, because there’s no non-NA value and so, no max correlation for this file.
That’s why I have this error: 0 (non-na) cases.
I tried to remove the NA columns, but as I’m working on a list.files, the number of files in the list and in the matrix will be not the same. I searched on the web but I only found some topics about removing NA columns.
In my case, I would like to ignore these NA columns without removing them.
I would like to say to R: when you are looking for the highest correlation for each file in the correlation matrix, if you see a file with no correlation coeff (only NAs column), don’t do anything with it, keep it like this and go to the next file (next column or row).
I also tried to put else {NA} or else {NULL} to avoid this problem but it still doesn’t work.
Does somebody have an idea how to solve this problem?
Thank you very much.
Best regards
Geoffrey
Thanks for all your answers.
I think it’s now working for NA columns with the code of Joran. And you’re also true Joran. I have now an error on the next function which is used to process the output of get.max.cor.
na.fill <- function(x, y){
i <- is.na(x[1:8700,1])
xx <- y[1:8700,1]
new <- data.frame(xx=xx)
x[1:8700,1][i] <- predict(lm(x[1:8700,1]~xx, na.action=na.exclude), new)[i]
x
}
process.all <- function(df.list, mat){
f <- function(station)
na.fill(df.list[[ station ]], df.list[[ max.cor[station] ]])
g <- function(station){
x <- df.list[[station]]
if(any(is.na(x[1:8700,1]))){
mat[row(mat) == col(mat)] <- -Inf
nas <- which(is.na(x[1:8700,1]))
ord <- order(mat[station, ], decreasing = TRUE)[-c(1, ncol(mat))]
for(y in ord){
if(all(!is.na(df.list[[y]][1:8700,1][nas]))){
xx <- df.list[[y]][1:8700,1]
new <- data.frame(xx=xx)
x[1:8700,1][nas] <- predict(lm(x[1:8700,1]~xx, na.action=na.exclude), new)[nas]
break
}
}
}
x
}
n <- length(df.list)
nms <- names(df.list)
max.cor <- sapply(seq.int(n), get.max.cor, corhiver2008capt1)
df.list <- lapply(seq.int(n), f)
df.list <- lapply(seq.int(n), g)
names(df.list) <- nms
df.list
}
refill <- process.all(lst, corhiver2008capt1)
refill <- as.data.frame(refill)
capt1_hiver <- refill[1:8700,]
This function is used to fill my files according to the best correlation file (what I did before to select it).
My error is:
Error in model.frame.default(formula = x[1:8700, 1] ~ xx, na.action = na.exclude, :
invalid type (NULL) for variable 'xx'
This is probably due to the “NULL columns” of my correlation matrix. There’s no data (files with no correlations) in the xx file.
How can I continue my calculation and ignore (or do nothing) with NULL files without modifying dimensions of my data (keep the NA files)? For the moment, it’s only working when I only select files with data (and not only NA).
So I’m not entirely sure what you’re getting at here with your example code, but here we go.
When I run your function on the row of
cor3with all missing values I get this:That is neither an error nor a warning, but it probably isn’t what you wanted. Your function, as written, replaces the diagonal elements with
-Inf, since each station is presumably perfectly correlated with itself, and you don’t want that to be the max.This means that the next line is not operating on a row with only
NAvalues; there’s that one-Infin there. Sona.rm = TRUEtosses all but the-Infvalue, and since that’s all that’s left, it’s the max.Clearly, that’s not what you wanted, though. But you also haven’t really specified how this function is being used to iterate through the rows, so I’ll have to speculate. If you modify your function to check whether the max value is finite:
you can have it return
NAwhen that row was allNAvalues. But then you’ll presumably also need to handle thatNAwith whatever code you’re using to process the output ofget.max.cor(which I can’t really help with because you haven’t provided it).