I have a dataset that contains observations for every second of four consecutive days (roughly 340’000 data points). This is too much to display in a scatter plot. I would like to plot only a uniform sample of, say, 2000 time points.
Is it possible to achieve this with ggplot2‘s “grammar of graphics” approach? I haven’t found any built-in “sampling” modifier, but perhaps it’s easy enough to write one?
library(ggplot2)
x <- 1:100000
d <- data.frame(x=x, y=rnorm(length(x)))
ggplot(d[sample(x, 2000), ], aes(x=x, y=y)) + geom_point()
This is how it can be “hacked” by modifying the data passed to ggplot. But I don’t want to modify the data, just filter it to include only a sample.
ggplot(d, aes(x=x, y=y)) + ??? + geom_point()
EDIT: I’m specifically looking for sampling, not smoothing or binning. The data I have shows the time it takes to simulate one second of a specific process. The simulation has been parallelized, and for each simulated seconds I have the run times for each of the cores involved (8 in total). I want to show sub-optimal load balancing by plotting just the raw data points. The reason for the sampling is just that 300’000 data points are way too much for a scatter plot: Plotting takes too long and the visualization is no good.
You can subset with in the
geom_pointcall using the data argument:This way, you are free to add other geoms using all the data, eg, using the example data: