I’ve got an 7.4Gb csv file. After converting it to a sqlite database with a python script the output DB is 4.7Gb, around 60% of the original size.
The csv has around 150,000,000 rows. It has header:
tkey,ipaddr,healthtime,numconnections,policystatus,activityflag
And each row looks something like
261846,172.10.28.15,2012-02-03 16:15:00,22,1,1
The script uses healthtime to split the data into tables 192 tables
When I first saw these numbers, I assumed I had made an error someplace. How big of a reduction in file size should I expect from the added efficiency of only writing the healthtime 192 times instead of 150,000,000 times?
EDIT: As soon as I posted this I realized the answer. I’m removing about 40% of the string, hence the 40% reduction in size.
Edit 2 Let’s calculate the difference in size between the plain text:
"261846,172.10.28.15,2012-02-03 16:15:00,22,1,1"
And the data base entry:
db(261846,'172.10.28.15',22,1,1)
First of all, we drop from 46 to 26 characters in plain text representation.
The remaining characters are:
"261846,172.10.28.15,22,1,1"
or 26 bytes. If each integer needs to be stored in 32 bit (4 bytes), then we have:
12 bytes (ipaddr) + 4 bytes * 4 (integer fields) = 28 bytes.
So it looks like converting to integers makes the storage slightly less efficient, and all of my gains come form reducing the number of characters stored in each row.
SQLite is not running a compression algorithm, but it will store data in a binary file instead of a text file. Which means that the data can be stored more efficiently, for example using a 32-bit (4 byte) number to represent
10,000,000instead of storing it as 8 bytes of text (or more if the file is unicode).Here are more details on the SQL Database File Format if you are interested.
Does that make sense?