I have a large table with 9 columns and 12 million rows, like this:
col1 col2 col3 col4 col5 col6 col7 col8 col9
12.3 37.4 7771 -675 -23 23.8 78.8 -892 67.5
79.3 -6.3 6061 -555 -24 28.1 77.1 -889 32.6
55.6 -7.3 8888 -921 -56 78.3 22.3 -443 22.9
.... .... .... .... .... .... .... .... ....
Currently the table is saved as TSV (tab-separated vector) format in my hard disk, 432MB in size. I want to populate the table into Redis in order to complete this kind of query most efficiently: Given a min value and a max value for each column, count the number of rows that are within the given range, i.e.
(min_col1 <= col1 <= max_col1) &&
(min_col2 <= col2 <= max_col2) &&
(min_col3 <= col3 <= max_col3) &&
(min_col4 <= col4 <= max_col4) &&
(min_col5 <= col5 <= max_col5) &&
(min_col6 <= col6 <= max_col6) &&
(min_col7 <= col7 <= max_col7) &&
(min_col8 <= col8 <= max_col8) &&
(min_col9 <= col9 <= max_col9)
So my questions are:
1) How to populate the table into Redis? What kind of key/value data structure should I use? Hashes, lists, sets, sorted sets, or what else?
2) After populating the table, given 9 min and max values for the 9 columns, how to write the query in order to get the count, i.e. number of rows falling within the 9 ranges? One way I can think of is, first find out the rows that satisfy (min_colX <= colX <= max_colX) for each X in 1 to 9, and then calculate their intersection. But I guess this is not the most efficient way. I just want to retrieve the count as fast as possible.
By the way, I have tried MongoDB. It is straightforward to populate the table using mongoimport, but it takes 10 seconds to complete my query, which is too slow and not acceptable for my real-time application. In contrast, Redis holds data in memory, so I hope Redis can shorten the query time to 1 second.
For your reference, this is what I did in MongoDB.
mongoimport -u my_username -p my_password -d my_db -c my_coll --type tsv --file my_table.tsv --headerline
use my_db
db.my_coll.ensureIndex({col1:1, col2:1, col3:1, col4:1, col5:1, col6:1, col7:1, col8:1, col9:1 }).
db.my_coll.count({ col1: {$gte: min_col1, $lte: max_col1), col2: {$gte: min_col2, $lte: max_col2}, col3: {$gte: min_col3, $lte: max_col3}, col4: {$gte: min_col4, $lte: max_col4}, col5: {$gte: min_col5, $lte: max_col5}, col6: {$gte: min_col6, $lte: max_col6}, col7: {$gte: min_col7, $lte: max_col7}, col8: {$gte: min_col8, $lte: max_col8}, col9: {$gte: min_col9, $lte: max_col9} }).
I used explain() to make sure the Btree index was actually used rather than a table scan.
I also tried creating a ram disk and saving the my MongoDB database into the ram disk, it shortened the query time from 10s to 9s, far from acceptable for my real-time application.
mkdir ~/ram
chmod -R 755 ~/ram
mount -t tmpfs none ~/ram -o size=8192m
mongod --dbpath ~/ram --noprealloc --smallfiles
Make each
cola sorted set, then useZRANGEBYSCOREon each key, and do the intersection and count in the application. I use phpredis and I do that a lot in memory, usingarray_intersect.The perfomance problem is in
ZADD, which you will use to create the sorted sets.Once you have all the sorted sets created in Redis’ memory, the rest is really fast.
Creating sorted sets (Redis sample)
PHP, finding ranges, intersection and count
Hope that helps.