I have a CSV file with timestamps and certain event-types which happened at this time.
What I want is count the number of occurences of certain event-types in 6-minutes intervals.
The input-data looks like:
date,type
"Sep 22, 2011 12:54:53.081240000","2"
"Sep 22, 2011 12:54:53.083493000","2"
"Sep 22, 2011 12:54:53.084025000","2"
"Sep 22, 2011 12:54:53.086493000","2"
I load and cure the data with this piece of code:
> raw_data <- read.csv('input.csv')
> cured_dates <- c(strptime(raw_data$date, '%b %d, %Y %H:%M:%S', tz="CEST"))
> cured_data <- data.frame(cured_dates, c(raw_data$type))
> colnames(cured_data) <- c('date', 'type')
After curing the data looks like this:
> head(cured_data)
date type
1 2011-09-22 14:54:53 2
2 2011-09-22 14:54:53 2
3 2011-09-22 14:54:53 2
4 2011-09-22 14:54:53 2
5 2011-09-22 14:54:53 1
6 2011-09-22 14:54:53 1
I read a lot of samples for xts and zoo, but somehow I can’t get a hang on it.
The output data should look something like:
date type count
2011-09-22 14:54:00 CEST 1 11
2011-09-22 14:54:00 CEST 2 19
2011-09-22 15:00:00 CEST 1 9
2011-09-22 15:00:00 CEST 2 12
2011-09-22 15:06:00 CEST 1 23
2011-09-22 15:06:00 CEST 2 18
Zoo’s aggregate function looks promising, I found this code-snippet:
# aggregate POSIXct seconds data every 10 minutes
tt <- seq(10, 2000, 10)
x <- zoo(tt, structure(tt, class = c("POSIXt", "POSIXct")))
aggregate(x, time(x) - as.numeric(time(x)) %% 600, mean)
Now I’m just wondering how I could apply this on my use case.
Naive as I am I tried:
> zoo_data <- zoo(cured_data$type, structure(cured_data$time, class = c("POSIXt", "POSIXct")))
> aggr_data = aggregate(zoo_data$type, time(zoo_data$time), - as.numeric(time(zoo_data$time)) %% 360, count)
Error in `$.zoo`(zoo_data, type) : not possible for univariate zoo series
I must admit that I’m not really confident in R, but I try. 🙂
I’m kinda lost. Could anyone point me into the right direction?
Thanks a lot!
Cheers, Alex.
Here the output of dput for a small subset of my data. The data itself is something around 80 million rows.
structure(list(date = structure(c(1316697885, 1316697885, 1316697885,
1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885,
1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885,
1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885,
1316697885, 1316697885), class = c("POSIXct", "POSIXt"), tzone = ""),
type = c(2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L)), .Names = c("date",
"type"), row.names = c(NA, -23L), class = "data.frame")
We can read it using
read.csv, convert the first column to a date time binned into 6 minute intervals and add a dummy column of 1’s. Then re-read it usingread.zoosplitting on the type and aggregating on the dummy column:With the above test data the solution looks like this:
Note that the above has been done in wide form since that form constitutes a time series whereas the long form does not. There is one column for each type. In our test data we had types 2, 3 and 4 so there are three columns.
(We have used chron here since its
truncmethod fits well with binning into 6 minute groups. chron does not support time zones which can be an advantage since you can’t make one of the many possible time zone errors but if you want POSIXct anyways convert it at the end, e.g.time(z) <- as.POSIXct(paste(as.Date.dates(time(z)), times(time(z)) %% 1)). This expression is shown in a table in one of the R News 4/1 articles except we usedas.Date.datesinstead of justas.Dateto work around a bug that seems to have been introduced since then. We could also usetime(z) <- as.POSIXct(time(z))but that would result in a different time zone.)EDIT:
The original solution binned into dates but I noticed afterwards that you wish to bin into 6 minute periods so the solution was revised.
EDIT:
Revised based on comment.