Is it possible to iterative over a single text file on a single multi-core machine in parallel with R? For context, the text file is somewhere between 250-400MB of JSON output.
EDIT:
Here are some code samples I have been playing around with. To my surprise, parallel processing did not win – just basic lapply – but this could be due to user error on my part. In addition, when trying to read a number of large files, my machine choked.
## test on first 100 rows of 1 twitter file
library(rjson)
library(parallel)
library(foreach)
library(plyr)
N = 100
library(rbenchmark)
mc.cores <- detectCores()
benchmark(lapply(readLines(FILE, n=N, warn=FALSE), fromJSON),
llply(readLines(FILE, n=N, warn=FALSE), fromJSON),
mclapply(readLines(FILE, n=N, warn=FALSE), fromJSON),
mclapply(readLines(FILE, n=N, warn=FALSE), fromJSON,
mc.cores=mc.cores),
foreach(x=readLines(FILE, n=N, warn=FALSE)) %do% fromJSON(x),
replications=100)
Here is a second code sample
parseData <- function(x) {
x <- tryCatch(fromJSON(x),
error=function(e) return(list())
)
## need to do a test to see if valid data, if so ,save out the files
if (!is.null(x$id_str)) {
x$created_at <- strptime(x$created_at,"%a %b %e %H:%M:%S %z %Y")
fname <- paste("rdata/",
format(x$created_at, "%m"),
format(x$created_at, "%d"),
format(x$created_at, "%Y"),
"_",
x$id_str,
sep="")
saveRDS(x, fname)
rm(x, fname)
gc(verbose=FALSE)
}
}
t3 <- system.time(lapply(readLines(FILES[1], n=-1, warn=FALSE), parseData))
The answer depends on what the problem actually is: reading the file in parallel, or processing the file in parallel.
Reading in parallel
You could split the JSON file into multiple input files and read them in parallel, e.g. using the
plyrfunctions combined with a parallel backend:Registering a backend can probably be done using the
parallelpackage which is now integrated in base R. Or you can use thedoSNOWpackage, see this post on my blog for details.Processing in parallel
In this scenario your best bet is to read the entire dataset into a vector of characters, split the data and then use a parallel backend combined with e.g. the
plyrfunctions.