I have a data.table DT and I want to run model.matrix on it. Each row has a string ID, which is stored in the ID column of DT. When I run model.matrix on DT, my formula excludes the ID column. The problem is, model.matrix drops some rows because of NAs. If I set the rownames of DT to the ID column, before calling model.matrix, then the final model matrix has rownames, and I’m all set. Otherwise, I can’t figure out what rows I end up with. I’m setting the rownames with rownames(DT) = DT$ID. However, when I try to add a new column to DT, I get a complaint about
“Invalid .internal.selfref detected . . . At an earlier point, this
data.table has been copied by R.”
So I’m wondering
- Is there a better way to set rownames for a
data.table - Is there a better approach to solving this problem.
There are a couple of issues here.
Firstly, it is a feature of a
data.tablethat they do not have arownames, instead they havekeys which are far more powerful. See this great vignette.But, it isn’t the end of the world.
model.matrixreturns sensible rownames when you pass it adata.tableFor example
So rows 2,3 and 5 are those included in the model.matrix.
Now, you can add this sequence as a column to
A. This will be useful if you then set the key to something else (thereby losing the original order)You might consider making it character (like the rownames of
mm)) but it won’t really matter (as you can just as easily convertrownames(mm)to numeric when you need to reference.As to the warning that
data.tablegives, if you read the next sentencerownamesare an attributerownames<-(internally at somepoint using the equivalent toattr<-) will (possibly copy) in the same way.The line from
`row.names<-.data.frame`isThat being said,
data.tablesdon’t have rownames, so there is no point setting them.