I have a database with about five possible index columns, all of which are useful in different ways. Let’s call them System, Source, Heat, Time, and Row. Using System and Row together will make a unique key, and if sorted by System-Row the database will also be sorted for any combination of the five index variables (in the order I listed them above).
My problem is that I use all combinations of these columns: sometimes I want to JOIN each System-Row to the next System-(Row+1), sometimes I want to GROUP or WHERE by System-Source-Heat, sometimes I want to look at all entries of System-Source WHERE Time is in a specific window, etc.
Basically, I want an index structure that functions similarly to every possible permutation of those five indexes (in the correct order, of course), without actually making every permutation (although I am willing to do so if necessary). I’m doing statistics / analytics, not traditional database work, so the size of the index and speed of creating / updating it is not a concern; I only care about speeding my improvised queries as I tend to think them up, run them, wait 5-10 minutes, and then never use them again. Thus my main concern is reducing the “wait 5-10 minutes” to something more like “wait 1-2 minutes.”
My sorted data would look something like this:
Sys So H Ti R
1 1 0 .1 1
1 1 1 .2 2
1 1 1 .3 3
1 1 2 .3 4
1 2 0 .5 5
1 2 0 .6 6
1 2 1 .8 7
1 2 2 .8 8
EDIT: It may simplify things a bit that System virtually always needs to be included as the first column to make any of the other 4 columns in sorted order.
Sorry for taking a while to get back to this, I had to work on something else for a few weeks. Anyway, after trying a bunch of things (including everything suggested here, even the brute-force “make an index for every permutation” method), I haven’t found any indexing method that significantly improves performance.
However, I HAVE found an alternate, non-indexing solution: selecting only the rows and columns I’m interested in into intermediary tables, and then working with those instead of the complete table (so I use about 5 mil rows of 6 cols instead of 30 mil rows of 35 cols). The initial select and table creation is a bit slow, but the steps after that are so much faster I actually save time even if I only run it once (and considering how often I change things, it’s usually much more than once).
I have a suspicion that the reason for this vast improvement will be obvious to most SQL users (probably something about pagefile size), and I apologize if so. My only excuse is that I’m a statistician trying to teach myself how to do this as I go, and while I’m pretty decent at getting what I want done to happen (eventually), my understanding of the mechanics of how it’s being done are distressingly close to “it’s a magic black box, don’t worry about it.”