The other day I asked a question about how to get a histogram of the date differences. I would like to do the same thing, but for groups and with a box plot, using lattice’s bwplot. Essentially, want 1 image with 5 box plots for each of the 5 different sources I have (I’ve shown 2 below in the example) — something like this
.
I’ve spent quite some time trying to figure this out, but cannot get it.
The closest I could come up
df <- read.csv("~/dates.csv", header = TRUE, sep = ",", quote = "\"")
a <- aggregate(as.POSIXct(as.character(df$REQUEST_DATE), format="%m/%d/%Y %H:%M:%S"), list(SOURCE=df$SOURCE), diff) # not sure if this is right (and I need -diff, but can't do that)
# now what? I seem to know how to access a$SOURCE, but don't know how to look at the data associated with a$SOURCE.
The data (~/dates.csv):
"SOURCE","REQUEST_DATE"
"A","09/11/2011 09:28:48"
"A","09/11/2011 09:21:15"
"A","09/11/2011 09:15:42"
"A","09/11/2011 09:12:18"
"D","09/13/2011 09:06:53"
"D","09/13/2011 09:06:18"
"D","09/13/2011 08:56:55"
"D","09/13/2011 08:56:18"
"D","09/13/2011 08:55:43"
"D","09/13/2011 08:39:07"
Here is a solution using the
plyrpackage for the data analysis, andggplot2package for the plot:Read the data. Note the use of
stringsAsFactors=FALSE– this saves lots of hassle converting toas.characterlater:Convert to POSIX date format:
Load
plyrand useddplyto a) group by SOURCE, b) calculate difftime, c) group results into a data.frame, all in one step:Load
ggplot2and plot. The plot looks a bit rubbish – that’s because the sample dataset is tiny. It will work better with larger datasets, i.e. you will get clear separation between median, range and outliers.