I have a Postgres 9.1.4 table with a few million rows. A very small version could be this:
master
index location
----------------
1 A
2 C
3 B
4 C
5 C
6 A
I need to do an expensive calculation on each distinct value in the location field. I don’t want to use this field from the master table because I would be repeatedly processing the same location. I want a table with distinct values of location, and the result of the calculation will be stored in a calculation field:
distinct
index location calculation
------------------------------
1 A' X
2 C' Y
3 B' Z
The relationship between master and distinct is not easily determined after distinct is populated. I have to do a little data manipulation to make the locations work in the calculation. I really need a third table, created roughly at the same time I populate distinct, to help me correlate each entry in distinct back to its parent in master.
The third table might look like this:
correlation
master_index distinct_index
------------------------------
1 1
2 3
3 2
4 3
5 3
6 1
The problem is I don’t see how this would be done with any straightforward SQL. I could use something like this as the beginning of a query to populate distinct:
SELECT location, array_agg(index)
FROM master
GROUP BY location;
The problem is I would need another array column in distinct to hold these values, then I’d later need to use some other program to parse the arrays and construct the correlation table.
Am I missing a simpler way of doing this?
You can create the “distinct” table as follows (although, I caution you to avoid using SQL key words as names of tables as columns):
Create the correlation table as:
These work, but you might want something more efficient. After creating the tables, you add indexes for efficiency. You can also do other tricks, such as precreating the distinct table with an auto incremented primary key, and using that as the distinct index. Then you would use insert to load the data into the query.