I am importing millions of rows from a tab file and SQLite .import .mode tabs is very slow. I have three indexes so probably the slowness comes from the indexing. But first I would like to check that .import add the rows grouping lots/all of them into a single commit. I was unable to find documentation how .import works. Do someone knows?. If the index is the problem (I had that issue before with mysql) how can I disable it and reindex at the end of the .import?
[Update 1]
Following @sixfeetsix comment.
My schema is:
CREATE TABLE ensembl_vf_b36 (
variation_name varchar(20),
chr varchar(4),
start integer,
end integer,
strand varchar(5),
allele_string varchar(3),
map_weight varchar(2),
flags varchar(50),
validation_status varchar(100),
consequence_type varchar(50)
);
CREATE INDEX pos_vf_b36_idx on ensembl_vf_b36 (chr, start, end);
data:
rs35701516 NT_113875 352 352 1 G/A 2 NULL NULL INTERGENIC
rs12090193 NT_113875 566 566 1 G/A 2 NULL NULL INTERGENIC
rs35448845 NT_113875 758 758 1 A/C 2 NULL NULL INTERGENIC
rs17274850 NT_113875 1758 1758 1 G/A 2 genotyped cluster,freq INTERGENIC
There are 15_608_032 entries in this table
And these are the stats
$ time sqlite3 -separator ' ' test_import.db '.import variations_build_36_ens-54.tab ensembl_vf_b36'
real 29m27.643s
user 4m14.176s
sys 0m15.204s
[Update 2]
@sixfeetsix has a good answer and if you are reading this, you would be also interested in
Faster bulk inserts in sqlite3?
Sqlite3: Disabling primary key index while inserting?
[update3] Solution from 30 min -> 4 min
Even with all the optimisations (see accepted answer) still takes almost 30 minutes but if the indexes are not used and added at the end then total time is 4 minutes:
-- importing without indexes:
real 2m22.274s
user 1m38.836s
sys 0m4.850s
-- adding indexes
$ time sqlite3 ensembl-test-b36.db < add_indexes-b36.sql
real 2m18.344s
user 1m26.264s
sys 0m6.422s
I believe the slowness indeed comes from building the index as more and more records are added. Depending on the RAM you have, you can tell sqlite to use enough memory so that all this index building activity is done in memory (ie without all the I/O that would happen otherwise with less memory).
For 15M records, I’d say you should set your cache size at 500000.
You can also tell sqlite to keep its transaction journal in memory.
Finally, you can set synchronous to OFF so sqlite never waits for writes to be committed to disk.
Using this I was able to divide the time it takes to import 15M records by 5 (14 minutes down to 2.5) with records made of random GUIDs split in 5 columns, using the three middle columns as an index:
So to try this I suggest you put all the instructions in some file, say import_test:
Then try it:
EDIT
This is a reply to the Pablo’s (the OP) comments following this answer (it’s to long to fit as a comment):
My (educated) guesses are that:
it hasn’t much ado with
transactions, I’m even inclined to
think that it is written to go
faster than even if you had all this
done in one “normal” transaction;
and,
allocate, and you set up your
environment as I suggest, the real
(time) hog here is reading the flat
file, then writing the final content
of the database, because what
happens in between happens extremely
fast; i.e. fast enough that there
ain’t much time to gain by optimizing it
when you compare such potential
gains with the (probably) non-compressible time spent on disk I/O.
If I’m wrong though I’d be glad to hear why for my own benefit.
EDIT 2
I did a comparison test between having the index in place during .import, and having it added immediately after .import finishes. I used the same technique of generating a 15M record made of split random UUIDs:
Then I tested importing with the index created before and after (here the index is created after):
So here is the time when adding the index before:
And when the index is added after:
You see how the “user” times difference (~9s) don’t account for the full times difference (~40s)? I To me this means that there is some extra I/O happening when the index is created before, and so I was wrong to think that all was being done in memory with no extra I/O.
Conclusion: create the index after and you’ll have even better import times (just as Donal mentioned).