I need to be able to process large JSON files, instantiating objects from deserializable sub-strings as we are iterating-over/streaming-in the file.
For example:
Let’s say I can only deserialize into instances of the following:
case class Data(val a: Int, val b: Int, val c: Int)
and the expected JSON format is:
{ "foo": [ {"a": 0, "b": 0, "c": 0 }, {"a": 0, "b": 0, "c": 1 } ],
"bar": [ {"a": 1, "b": 0, "c": 0 }, {"a": 1, "b": 0, "c": 1 } ],
.... MANY ITEMS .... ,
"qux": [ {"a": 0, "b": 0, "c": 0 } }
What I would like to do is:
import com.codahale.jerkson.Json
val dataSeq : Seq[Data] = Json.advanceToValue("foo").stream[Data](fileStream)
// NOTE: this will not compile since I pulled the "advanceToValue" out of thin air.
As a final note, I would prefer to find a solution that involves Jerkson or any other libraries that comes with the Play framework, but if another Scala library handles this scenario with greater ease and decent performance: I’m not opposed to trying another library. If there is a clean way of manually seeking through the file and then using a Json library to continue parsing from there: I’m fine with that.
What I do not want to do is ingest the entire file without streaming or using an iterator, as keeping the entire file in memory at a time would be prohibitively expensive.
Here is the current way I am solving the problem:
Granted, This code doesn’t exactly handle malformed JSON very cleanly and to use for multiple top-level keys “foo”, “bar” and “qux”, will require looking ahead (or matching from a list of possible top-level keys), but in general: I believe this does the job. It’s not quite as functional as I’d like and isn’t super robust but PagedSeqReader definitely keeps this from getting too messy.