Edit: this question is outdated. The jsonlite package flattens automatically.
I am dealing with online datastreams that have record-based encoding, usually in JSON. The structure of the object (i.e. the names in the JSON) are known from the API documentation, however, values are mostly optional and not present in every record. Lists can contain new lists, and the structure is sometimes quite deep. Here is a quite simple example of some GPS data: http://pastebin.com/raw.php?i=yz6z9t25. Note that in the lower rows, the "l" object is missing due to no GPS signal.
I am looking for an elegant way to flatten these objects into a dataframe. I am currently using something like this:
library(RJSONIO)
library(plyr)
obj <- fromJSON("http://pastebin.com/raw.php?i=yz6z9t25", simplifyWithNames=FALSE, simplify=FALSE)
flatdata <- lapply(obj$data, as.data.frame);
mydf <- rbind.fill(flatdata)
This does the job, however it is slow and a bit error prone. A problem with this approach is that I am not using my knowledge about the structure (object names) in the data; instead it is inferred from the data. This leads to problems when a certain property happens to be absent in every record. In this case, it will not appear in the dataframe at all, instead of a column with NA values. This can lead to issues downstream. For example, I need to process the location timestamp:
mydf$l.t <- structure(mydf$l.t/1000, class="POSIXct")
However, this will result in an error in case of a dataset in which the l$t object isn’t there. Furthermore both the as.data.frame and rbind.fill make things quite slow. The example dataset is a relatively small one. Any suggestions for better implementation? A robust solution would always yield a dataframe with the same columns in the same order, and where only the number of rows varies.
Edit: below a dataset with more meta data. It is larger in size and nested more deeply:
obj <- fromJSON("http://www.stat.ucla.edu/~jeroen/files/output.json", simplifyWithNames=FALSE, simplify=FALSE)
Just for clarity, I am adding a combination of Josh and Joshua’s solution which is the best I have come up with so far.
The function is reasonably fast. I still think it should be able to speed this up though:
It also allows you to ‘force’ certain columns, although it doesn’t result in too much of a speedup: