I am trying to remove NAs from my data frame by interpolation with na.approx() but can’t remove all of the NAs.
My data frame is a 4096×4096 with 270.15 as flag for non valid value. I need data to be continous in all points to feed a meteorological model. Yesterday I asked, and obtained an answer, on how to replace values in a data frame based in another data frame. But after that I came to na.approx() and then decided to replace the 270.15 values with NA and try na.approx() to interpolate data. But the question is why na.approx() does not replace all NAs.
This is what I am doing:
- Read the original hdf file with hdf5load
- Subset the data frame (4094×4096)
-
Substitute flag value with NA
> sst4[sst4 == 270.15 ] = NA -
Check first column (or any other)
> summary(sst4[,1]) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 271.3 276.4 285.9 285.5 292.3 302.8 1345.0 -
Run na.approx
> sst4=na.approx(sst4,na.rm="FALSE") -
Check first column
> summary(sst4[,1]) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 271.3 276.5 286.3 285.9 292.6 302.8 411.0
As you can see 411 NA’s have not been removed. Why? Do they all correspond to leading/ending column values?
head(sst4[,1])
[1] NA NA NA NA NA NA
tail(sst4[,1])
[1] NA NA NA NA NA NA
Is it needed by na.approx to have valid values before and after NA to interpolate? Do I need to set any other na.approx option?
Thank you very much
A small, reproducible example:
Yup, looks like you do need the start/end values of columns to be known or the interpolation doesn’t work. Can you guess values for your boundaries?
ANOTHER EDIT: So by default, you need the start and end values of columns to be known. However it is possible to get
na.approxto always fill in the blanks by passingrule = 2. See Felix’s answer. You can also usena.fillto provide a default value, as per Gabor’s comment. Finally, you can interpolate boundary conditions in two directions (see below) or guess boundary conditions.EDIT: A further thought. Since
na.approxis only interpolating in columns, and your data is spacial, perhaps interpolating in rows would be useful too. Then you could take the average.na.approxfails when whole columns areNA, so we create a bigger dataset.Run
na.approxboth ways.Find out the best guess.