I have a large data set in R (1.2M records). Those are some readings

Question

0

Asked: May 25, 20262026-05-25T01:30:19+00:00 2026-05-25T01:30:19+00:00

I have a large data set in R (1.2M records). Those are some readings

0

I have a large data set in R (1.2M records). Those are some readings for different protocols. Now, I would like to classify this data (which I can do with rpart/RWeka). However, I first need to process the data, and this question is exactly about that.

The data set consists of a pair of outputs (throughput,response time) per set of control parameters, for 4 different protocols. Now, I would like to “bin” these values, and for each set of control parameters choose only those protocols which are in 10% of the maximum throughput (for that set of input params), and in 10% of minimim response time.

I know I can use aggregate to find max throughput, min response time in another data.frame, and then join it with original data.frame. Then, I can use ifelse to find those protocol names matching criteria. However, that seems to me as inefficient, and I don’t know how would I encode multiple matches (per set of input values) in a single column.

Any suggestions?

Example (REQS and REPS are input parameters):

PROTO  REQS  REPS  THR  RT
A      8     8     10   1
B      8     8     9.5  2
C      8     8     7    1.1
A      16    8     10   4
B      16    8     5    1
C      16    8     1    0.5
A      8     16    8    1
B      8     16    10   1.09
C      8     16    9.5  1

Should produce something like:

REQS REPS THRGOOD RTGOOD BOTHGOOD
8    8    A,B     A,C    A
16   8    A       C      empty
8    16   B,C     A,B,C  B,C

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T01:30:19+00:00

ddplyfrom the plyrpackage should be your friend here.

First, write a function that will give you the desired result if you were to get a data.frame with only the rows for 1 set of input parameters:

forOneSet<-function(dfr)
{
  THRlim<-0.9*max(dfr$THR) #is this what you want - adapt if needed?
  RTlim<-0.1*min(dfr$RT) #is this what you want - rather unlikely - adapt if needed?
  thrgood<-dfr$PROTO[dfr$THR > THRlim]
  rtgood<-dfr$PROTO[dfr$RT < RTlim]
  bothgood<-union(thrgood, rtgood)
  #return a data.frame with the wanted results for this 'partial' data.frame
  data.frame(REQS=dfr$REQS[1], REPS=dfr$REPS[1], THRGOOD=paste(thrgood, collapse=","), RTGOOD=paste(rtgood, collapse=","), BOTHGOOD=paste(bothgood, collapse=","))
}

Now you can immediately use ddply (I’m assuming your original data.frame is called orgdfr):

result<-ddply(orgdfr, .(REQS, REPS), forOneSet)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a large data set in R (1.2M records). Those are some readings

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply