Is it possible to iterative over a single text file on a single multi-core

Question

0

Asked: June 15, 20262026-06-15T02:01:44+00:00 2026-06-15T02:01:44+00:00

Is it possible to iterative over a single text file on a single multi-core

0

Is it possible to iterative over a single text file on a single multi-core machine in parallel with R? For context, the text file is somewhere between 250-400MB of JSON output.

EDIT:

Here are some code samples I have been playing around with. To my surprise, parallel processing did not win – just basic lapply – but this could be due to user error on my part. In addition, when trying to read a number of large files, my machine choked.

## test on first 100 rows of 1 twitter file
library(rjson)
library(parallel)
library(foreach)
library(plyr)
N = 100
library(rbenchmark)
mc.cores <- detectCores()
benchmark(lapply(readLines(FILE, n=N, warn=FALSE), fromJSON),
          llply(readLines(FILE, n=N, warn=FALSE), fromJSON),
          mclapply(readLines(FILE, n=N, warn=FALSE), fromJSON),
          mclapply(readLines(FILE, n=N, warn=FALSE), fromJSON, 
                   mc.cores=mc.cores),
          foreach(x=readLines(FILE, n=N, warn=FALSE)) %do% fromJSON(x),
          replications=100)

Here is a second code sample

parseData <- function(x) {
  x <- tryCatch(fromJSON(x), 
                error=function(e) return(list())
                )
  ## need to do a test to see if valid data, if so ,save out the files
  if (!is.null(x$id_str)) {
    x$created_at <- strptime(x$created_at,"%a %b %e %H:%M:%S %z %Y")
    fname <- paste("rdata/",
                   format(x$created_at, "%m"),
                   format(x$created_at, "%d"),
                   format(x$created_at, "%Y"),
                   "_",
                   x$id_str,
                   sep="")
    saveRDS(x, fname)
    rm(x, fname)
    gc(verbose=FALSE)
  }
}

t3 <- system.time(lapply(readLines(FILES[1], n=-1, warn=FALSE), parseData))

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T02:01:46+00:00

The answer depends on what the problem actually is: reading the file in parallel, or processing the file in parallel.

Reading in parallel

You could split the JSON file into multiple input files and read them in parallel, e.g. using the plyr functions combined with a parallel backend:

result = ldply(list.files(pattern = ".json"), readJSON, .parallel = TRUE)

Registering a backend can probably be done using the parallel package which is now integrated in base R. Or you can use the doSNOW package, see this post on my blog for details.

Processing in parallel

In this scenario your best bet is to read the entire dataset into a vector of characters, split the data and then use a parallel backend combined with e.g. the plyr functions.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Is it possible to iterative over a single text file on a single multi-core

Leave an answerCancel reply

1 Answer

Reading in parallel

Processing in parallel

Leave an answer
Cancel reply