I have a query that returns me around 6 million rows, which is too big to process all at once in memory.
Each query is returning a Tuple3[String, Int, java.sql.Timestamp]. I know the string is never more than about 20 characters, UTF8.
How can I work out the max size of one of these tuples, and more generally, how can I approximate the size of a scala data-structure like this?
I’ve got 6Gb on the machine I’m using. However, the data is being read from the database using scala-query into scala’s Lists.
Scala objects follow approximately the same rules as Java objects, so any information on those is accurate. Here is one source, which seems at least mostly right for 32 bit JVMs. (64 bit JVMs use 8 bytes per pointer, which generally works out to 4 bytes extra overhead plus 4 bytes per pointer–but there may be less if the JVM is using compressed pointers, which it does by default now, I think.)
I’ll assume a 64 bit machine without compressed pointers (worst case); then a
Tuple3has two pointers (16 bytes) plus anInt(4 bytes) plus object overhead (~12 bytes) rounded to the nearest 8, or 32 bytes, plus an extra object (8 bytes) as a stub for the non-specialized version ofInt. (Sadly, if you use primitives in tuples they take even more space than when you use wrapped versions.).Stringis 32 bytes, IIRC, plus the array for the data which is 16 plus 2 per character.java.sql.Timestampneeds to store a couple ofLongs (I think it is), so that’s 32 bytes. All told, it’s on the order of 120 bytes plus two per character, which at ~20 characters is ~160 bytes.Alternatively, see this answer for a way to measure the size of your objects directly. When I measure it this way, I get 160 bytes (and my estimate above has been corrected using this data so it matches; I had several small errors before).