As you would expect from a DSL aimed at data analysis, R handles missing/incomplete data very well, for instance:
Many R functions have an na.rm flag that when set to TRUE, remove the NAs:
>>> v = mean( c(5, NA, 6, 12, NA, 87, 9, NA, 43, 67), na.rm=T)
>>> v
(5, 6, 12, 87, 9, 43, 67)
But if you want to deal with NAs before the function call, you need to do something like this:
to remove each ‘NA’ from a vector:
vx = vx[!is.na(a)]
to remove each ‘NA’ from a vector and replace it w/ a ‘0’:
ifelse(is.na(vx), 0, vx)
to remove entire each row that contains ‘NA’ from a data frame:
dfx = dfx[complete.cases(dfx),]
All of these functions permanently remove ‘NA’ or rows with an ‘NA’ in them.
Sometimes this isn’t quite what you want though–making an ‘NA’-excised copy of the data frame might be necessary for the next step in the workflow but in subsequent steps you often want those rows back (e.g., to calculate a column-wise statistic for a column that has missing rows caused by a prior call to ‘complete cases’ yet that column has no ‘NA’ values in it).
to be as clear as possible about what i’m looking for: python/numpy has a class, masked array, with a mask method, which lets you conceal–but not remove–NAs during a function call. Is there an analogous function in R?
Exactly what to do with missing data — which may be flagged as
NAif we know it is missing — may well differ from domain to domain.To take an example related to time series, where you may want to skip, or fill, or interpolate, or interpolate differently, … is that just the (very useful and popular) zoo has all these functions related to
NAhandling:allowing to approximate (using different algorithms), carry-forward or backward, use spline interpolation or trim.
Another example would be the numerous missing imputation packages on CRAN — often providing domain-specific solutions. [ So if you call R a DSL, what is this? “Sub-domain specific solutions for domain specific languages” or SDSSFDSL? Quite a mouthful 🙂 ]
But for your specific question: no, I am not aware of a bit-level flag in base R that allows you to mark observations as ‘to be excluded’. I presume most R users would resort to functions like
na.omit()et al or use thena.rm=TRUEoption you mentioned.