Question
I’d like to use ggplot’s geom_boxplot and use my own data columns for the quantile segments, instead of those returned by stat_boxplot.
The data, after doing some transformations, looks like this:
> allquartile
T method s.0% s.25% s.50% s.75% s.100%
1 2 LDA -196.76273 -190.38842 -184.01411 -177.63979 -171.26548
2 3 LDA -171.53987 -166.16923 -160.79859 -115.28652 -69.77446
3 4 LDA -161.17590 -157.61372 -149.71026 -124.68926 -69.77446
4 5 LDA -194.10553 -179.83165 -175.14337 -168.46104 -159.07206
After doing a lot of searching and digging, I figured out that my plotting command should look like this:
p <- ggplot(allquartile,aes(x=T, ymin=`s.0%`, lower=`s.25%`,
middle=`s.50%`, upper=`s.75%`,
ymax=`s.100%`, color=method)) +
geom_boxplot(stat="identity")
This should use s.0% as the min, s.25% as the lower, etc etc. But when i try to display p, i get the following error:
Error in eval(expr, envir, enclos) : object 's.0%' not found
Calls: print ... lapply -> is.vector -> lapply -> FUN -> eval -> eval
I’ve also tried using aes_string in place of aes, and I instead get this error:
Error in aes_string(x = T, ymin = `s.0%`, lower = `s.25%`, middle = `s.50%`, :
object 's.0%' not found
I’m fairly new to both R and ggplot2, so i’m not realy sure how to interpret this, but I’m assuming it’s because of the . in s.0%.
I’d greatly appreciate any suggestions on how to get around this.
Edit: I’ve dug around more and I think this is due to my misunderstanding of the quantile method. I created allquartile by this command:
allquartile <-aggregate(list(s=topicquality$score), list(T=topicquality$T,method=topicquality$method),FUN=quantile,probs=seq(0, 1, .25))
And I realize that there are no columns named score.0%, score.25%, etc. There is just the score column with 5 values. So this boils down to: how do i access those 5 values within score?
SOLUTION
I’ve found the issue with my dataset. As i mentioned in my edit, the columns score.0%, score.25%, etc didn’t exist based on how i formed the data frame. For example, running colnames(allquartile) returned:
[1] "T" "method" "score"
It turns out that the score column is a vector of values. Running allquartile$score gives me:
0% 25% 50% 75% 100%
[1,] -196.7627 -190.3884 -184.0141 -177.6398 -171.26548
[2,] -171.5399 -166.1692 -160.7986 -115.2865 -69.77446
[3,] -161.1759 -157.6137 -149.7103 -124.6893 -69.77446
[4,] -194.1055 -179.8316 -175.1434 -168.4610 -159.07206
[5,] -200.1544 -174.2835 -167.7209 -145.3432 -129.54586
I can then access each individual quantile’s values by doing
> allquartile$score[,1]
[1] -196.7627 -171.5399 -161.1759 -194.1055 -200.1544
I’m not familiar with R enough to know what kind of data structure this is, but I would call it a matrix. So like any good matrix object, m[,column] returns the values of the column while m[row,] returns the values of the row, and m[row, column] gets the cell value.
With that in mind, I’ve realized that the propper plotting command should be
p <- ggplot(allquartile,
aes(x=T,
ymin=score[,1],
lower=score[,2],
middle=score[,3],
upper=score[,4],
ymax=score[,5],
color=method)) +
geom_boxplot(stat="identity")
And this plots out everything perfectly.
Thanks to everyone for the good suggestions, even though they didn’t fix the problem, they helped a lot in figuring things out.
Actually, based on your edits, I think your real problem is that you shouldn’t have been using
aggregate. If the function you are applying returns multiple values (likequantile),aggregatereturns the results in the somewhat inconvenient format you observed, by default.What’s happening is this. A data frame, somewhat confusingly, is actually a list, with each column being an element of the list. The only requirement being that each ‘column’ has the same number of rows. So you’re getting a data frame back with three ‘columns’: the third column is a just a matrix!
Doing this with
aggregateis possible, but there are more convenient tools out there. (For instance, you could callcbind(allquartile[,1:2],allquartile[,3])to create a data frame of the ‘correct’ dimensions.)For example, a very popular one is
ddplyfrom theplyrpackage. Here’s an example using some made up data, but following the general structure of your data:You’ll note that this will return a data frame of the dimensions you expect, but you still have to deal with the inconvenient column names. That’s best dealt with in the function you apply to each piece: