I’m working on constructing a database (SQLite) to store information about each run of a Mathematica script I’ve written. The script takes several input parameters, so my DB has a table with a column for each parameter (among other columns).
Some of the input parameters are lists of numbers. My first thought for storing these is to use a junction table as described in the accepted answer to this question. But I typically use the same list for several different runs. How can I look up whether any given list is already in the database, so I can reuse its ID rather than storing it again?
Constraints as mentioned in comments:
- There is no explicit upper bound on the length of a list but in practice it ranges from 1 to about 50.
- The number of distinct lists will be small, on the order of 10.
- I actually have 3 list parameters. For two of them, the values in the list are non-negative, double precision floating point numbers; for the third, the values are pairs of such numbers.
- There are no duplicate entries. (These are more precisely sets, so no duplicates and order is irrelevant)
- I can easily arrange for the list elements to be in sorted order.
For example: suppose my table is set up like this
CREATE TABLE jobs (id INTEGER PRIMARY KEY, param1 REAL, param2_id INTEGER);
CREATE TABLE param2 (param2_id INTEGER PRIMARY KEY, value REAL);
When I run the script, it sets the parameters and then calls a function to run the calculation, like so:
param1 = 4;
param2 = {.1, .3, .5};
runTheCalculation[param1, param2]
Assuming this is the very first run of the script, it will insert the following contents into the DB:
jobs: id param1 param2_id
1 4.0 1
param2: param2_id value
1 0.1
1 0.3
1 0.5
So far, so good. Now let’s say I run the script again with one different parameter,
param1 = 2;
param2 = {.1, .3, .5};
runTheCalculation[]
In a naive implementation, this will result in the database containing this:
jobs: id param1 param2_id
1 4.0 1
2 2.0 2
param2: param2_id value
1 0.1
1 0.3
1 0.5
2 0.1
2 0.3
2 0.5
But I would like it to be able to look up the fact that the list {.1, .3, .5} is already in the database, so that after the second run the DB contains this instead:
jobs: id param1 param2_id
1 4.0 1
2 2.0 1
param2: param2_id value
1 0.1
1 0.3
1 0.5
What sort of a query can I use to find that the list {.1, .3, .5} already exists in the table param2?
I’m not opposed to creating additional tables if necessary. Or if there is some model other than using a junction table that makes more sense, that’s fine too.
If the list is short, and the quantity of lists is relatively small, then you can simply sequence the lists in the
TBL_Listsand see if yours matches. This is pretty inefficient as it will enumerate all stored lists to compare to your one stored list.Another way, and the better way in my opinion, would be to hash the list and store its hash in a
TBL_List_HashesHashing the list will require enumerating it one time.
An example hashing algorithm might be to build a string of all the sorted numerical values, uniformally padded, then run any hashing method on the concatenated string.
It should be relatively easy to obtain a hash of a given list and then retrieve the matching hash from the DB.
Even with a relatively simple hash algorithm with collisions you will be able to significantly reduce the number of lists you need to validate in order to make the comparison.
So if your hash algorithm has collisions then you’re adding an enumeration (and query) expense for each erroneous match.
EDIT:
Here is a relevant answer for .net
.net 3.5 List<T> Equality and GetHashCode
EDIT2:
And if you are order-agnostic in your matching then simply standardize the list order before hashing
GetHashCode for a Class with a List Object