I want to count many strings (>3G), so I choose SQLite with a table of (str TEXT PRIMARY KEY, count INTEGER DEFAULT 1).
There are about 3G strings, each takes 40*2/8=10 bytes, thus the whole strings is 30GB.
Of those 10 bytes, there are 2^80 kinds, which is much larger than 3G.
So how to update effectively ?
UPDATE table SET count = count + 1 WHERE str = 'xxx';
# check whether rows infected
INSERT INTO table (str) VALUES ('yyy')
Or sth. like INSERT OR REPLACE, which I am not familiar with.
Any suggestions ?
I follow Sinan Ünür’s way:
PRAGMA synchronous = OFF;
PRAGMA journal_mode = OFF;
PRAGMA temp_store = MEMORY;
PRAGMA auto_vacuum = NONE;
PRAGMA cache_size = 4000000;
CREATE TABLE kmers ( seq TEXT );
SELECT seq,COUNT(seq) FROM kmers GROUP BY seq;
No index used. Autocommit is 0.
And I have not test whether journal_mode OFF is faster.
temp_store should be useless.
This is really not a Perl question but a SQL question. In any case, you do not need the
COUNTcolumn as SQLite provides a builtincountfunction to do the counting for you:SELECT str, countr(str) FROM mytable GROUP BY strshould give you each unique
strand the number of times it appears in the table.Of course, if you defined your table with
stras the primary key, you cannot insert multiplestrs by definition, so your table structure needs to be refined.UPDATE:
If I were to do this (and I am not sure I would), I would set up a table with an autogenerated
idcolumn and a column for the string. SQLite’s INTEGER PRIMARY KEY, a 64-bit integer would be sufficient to assign a unique id to each string inserted.Then, I would use the query above to get the frequencies by string.
If you are inserting via Perl’s DBI, make sure to turn
AutoCommitoff during insertion and remember to commit at the end (or periodically).Creating an index seems almost mandatory but it should be done after all the strings are in the database and before any queries are run.
The SQL:
Output:
$VAR1 = { '9876543210' => { 'count(string)' => '9', 'string' => '9876543210' }, '0123456789' => { 'count(string)' => '1', 'string' => '0123456789' } };