I have a rather complex database-query which gives me 30 million records – roughly 15 times the amount of data which would fit into memory. I need to access all records from the database sequentially (i.e. sorted). For performance reasons it is not possible to use an “order by” statement as the preparation of the ordered ResultSet uses roughly 40 minutes.
I see two possible options to solve my problem:
-
Dump the resulting data into an unordered file and use some form of merge-sort to arrive with a sorted file
-
Flatten data and dump it into a secondary database and reselect it using ordering mechanisms of the database.
Which would you prefer for reasons of elegance and performance?
If your choice is number two, do you have a suggestion for the database to use? Would you prefer SQLite, MySQL or Apache Derby?
For sorting large amounts of data, one solution is to sort them into blocks of data you can load. e.g a 30th (15 * 2) and sort those records. This will give you 30 sorted files.
Take the 30 sorted files and do a merge sort between them. (This requires at least 30 records in memory) You can process them as you sort them.
BTW: Its is also possible its time to buy a more powerful computer. You can buy a PC with 16 GB of memory and an SSD for close to $1000. For $2000 you can get a fast PC with 32 GB of memory. This could save you a lot of time. 😉