I am having trouble figuring out the most elegant and flexible way to switch data from long format to wide format when I have more than one measure variable I want to bring along.
For example, here’s a simple data frame in long format. ID is the subject, TIME is a time variable, and X and Y are measurements made of ID at TIME:
> my.df <- data.frame(ID=rep(c("A","B","C"), 5), TIME=rep(1:5, each=3), X=1:15, Y=16:30)
> my.df
ID TIME X Y
1 A 1 1 16
2 B 1 2 17
3 C 1 3 18
4 A 2 4 19
5 B 2 5 20
6 C 2 6 21
7 A 3 7 22
8 B 3 8 23
9 C 3 9 24
10 A 4 10 25
11 B 4 11 26
12 C 4 12 27
13 A 5 13 28
14 B 5 14 29
15 C 5 15 30
If I just wanted to turn the values of TIME into column headers containing the include X, I know I can use cast() from the reshape package (or dcast() from reshape2):
> cast(my.df, ID ~ TIME, value="X")
ID 1 2 3 4 5
1 A 1 4 7 10 13
2 B 2 5 8 11 14
3 C 3 6 9 12 15
But what I really want to do is also bring along Y as another measure variable, and have the column names reflect both the measure variable name and the time value:
ID X_1 X_2 X_3 X_4 X_5 Y_1 Y_2 Y_3 Y_4 Y_5
1 A 1 4 7 10 13 16 19 22 25 28
2 B 2 5 8 11 14 17 20 23 26 29
3 C 3 6 9 12 15 18 21 24 27 30
(FWIW, I don’t really care if all the X‘s are first followed by the Y‘s, or if they are interleaved as X_1, Y_1, X_2, Y_2, etc.)
I can get close to this by cast-ing the long data twice and merging the results, though the column names need some work, and I would need to tweak it if I needed to add a 3rd or 4th variable in addition to X and Y:
merge(
cast(my.df, ID ~ TIME, value="X"),
cast(my.df, ID ~ TIME, value="Y"),
by="ID", suffixes=c("_X","_Y")
)
Seems like some combination of functions in reshape2 and/or plyr should be able to do this more elegantly that my attempt, as well as handling multiple measure variables more cleanly. Something like cast(my.df, ID ~ TIME, value=c("X","Y")), which isn’t valid. But I haven’t been able to figure it out.
In order to handle multiple variables like you want, you need to
meltthe data you have before casting it.which gives
EDIT based on comment:
The data frame
gives a different result (all entries are 2) because the
ID/TIMEcombination does not indicate a unique row. In fact, there are two rows with eachID/TIMEcombinations.reshape2assumes a single value for each possible combination of the variables and will apply a summary function to create a single variable is there are multiple entries. That is why there is the warningYou can get something that works if you add another variable which breaks that redundancy.
This works because
cycle/ID/timenow uniquely defines a row inmy.df.