I have a dataset with about 3 million rows and the following structure:
PatientID| Year | PrimaryConditionGroup
---------------------------------------
1 | Y1 | TRAUMA
1 | Y1 | PREGNANCY
2 | Y2 | SEIZURE
3 | Y1 | TRAUMA
Being fairly new to R, I have some trouble finding the right way to reshape the data into the structure outlined below:
PatientID| Year | TRAUMA | PREGNANCY | SEIZURE
----------------------------------------------
1 | Y1 | 1 | 1 | 0
2 | Y2 | 0 | 0 | 1
3 | Y1 | 1 | 0 | 1
My question is: What is the fastest/most elegant way to create a data.frame, where the values of PrimaryConditionGroup become columns, grouped by PatientID and Year (counting the number of occurences)?
There are probably more succinct ways of doing this, but for sheer speed, it’s hard to beat a
data.table-based solution:EDIT:
aggregate()provides a ‘base R’ solution that might or might not be more idiomatic. (The sole complication is that aggregate returns a matrix, rather than a data.frame; the second line below fixes that up.)2nd EDIT Finally, a succinct solution using the
reshapepackage gets you to the same place.