I have a large data set in R (1.2M records). Those are some readings for different protocols. Now, I would like to classify this data (which I can do with rpart/RWeka). However, I first need to process the data, and this question is exactly about that.
The data set consists of a pair of outputs (throughput,response time) per set of control parameters, for 4 different protocols. Now, I would like to “bin” these values, and for each set of control parameters choose only those protocols which are in 10% of the maximum throughput (for that set of input params), and in 10% of minimim response time.
I know I can use aggregate to find max throughput, min response time in another data.frame, and then join it with original data.frame. Then, I can use ifelse to find those protocol names matching criteria. However, that seems to me as inefficient, and I don’t know how would I encode multiple matches (per set of input values) in a single column.
Any suggestions?
Example (REQS and REPS are input parameters):
PROTO REQS REPS THR RT
A 8 8 10 1
B 8 8 9.5 2
C 8 8 7 1.1
A 16 8 10 4
B 16 8 5 1
C 16 8 1 0.5
A 8 16 8 1
B 8 16 10 1.09
C 8 16 9.5 1
Should produce something like:
REQS REPS THRGOOD RTGOOD BOTHGOOD
8 8 A,B A,C A
16 8 A C empty
8 16 B,C A,B,C B,C
ddplyfrom theplyrpackage should be your friend here.First, write a function that will give you the desired result if you were to get a data.frame with only the rows for 1 set of input parameters:
Now you can immediately use
ddply(I’m assuming your original data.frame is called orgdfr):