I am an R neophyte, with a data frame of database function runtimes with

Question

0

Asked: May 23, 20262026-05-23T21:48:47+00:00 2026-05-23T21:48:47+00:00

I am an R neophyte, with a data frame of database function runtimes with

0

I am an R neophyte, with a data frame of database function runtimes with the following data:

> head(data2)
              dbfunc runtime
1 fn_slot03_byperson  38.083
2 fn_slot03_byperson  32.396
3 fn_slot03_byperson  41.246
4 fn_slot03_byperson  92.904
5 fn_slot03_byperson 130.512
6 fn_slot03_byperson 113.853

The data has data for 127 discrete functions comprising of some 1940170 rows.

I would like to:

Summarise the data to only include database functions with a mean runtime of over 100 ms
Produce boxplots of the 25 slowest database functions showing the distribution of runtimes, sorted by slowest first.

I’m particularly stumped by the summary step.

Note : I’ve also asked this questions at stats.stackexchange.com.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T21:48:48+00:00

Here’s one approach using ggplot and plyr. The steps you outlined could be combined to be slightly more efficient, but for learning purposes I’ll show you the steps as you asked them.

#Load ggplot and make some fake data
library(ggplot2)
dat <- data.frame(dbfunc = rep(letters[1:10], each = 100)
                  , runtime = runif(1000, max = 300))

#Use plyr to calculate a new variable for the mean runtime by dbfunc and add as 
#a new column
dat <- ddply(dat, "dbfunc", transform, meanRunTime = mean(runtime))

#Subset only those dbfunc with mean run times greater than 100. Is this step necessary?
dat.long <- subset(dat, meanRunTime > 100)


#Reorder the level for the dbfunc variable in terms of the mean runtime. Note that relevel
#accepts a function like mean so if the subset step above isn't necessary, then we can simply
#use that instead.
dat.long$dbfunc <- reorder(dat.long$dbfunc, -dat.long$meanRunTime)

#Subset one more time to get the top *n* dbfunctions based on mean runtime. I chose three here...
dat.plot <- subset(dat.long, dbfunc %in% levels(dbfunc)[1:3])

#Now you have your top three dbfuncs, but a bunch of unused levels hanging out so let's drop them
dat.plot$dbfunc <- droplevels(dat.plot$dbfunc)

#Plotting time!
ggplot(dat.plot, aes(dbfunc, runtime)) + 
  geom_boxplot()

Like I said, I feel a few of those steps could be combined and made more efficient, but wanted to show you the steps as you outlined them.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am an R neophyte, with a data frame of database function runtimes with

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply