I am currently trying to process a large block of simulation data (~2Gb worth). The data is in a table which looks like:
Table: Simulation Data
+-------+--------+----------+-------+
| id | run_id | timestep | value |
+-------+--------+----------+-------+
| 1 | 1 | 1 | 0.00 |
| 2 | 1 | 2 | 0.003 |
| : | : | : | : |
| 9543 | 1 | 9543 | 0.23 |
| 9544 | 2 | 1 | 0.00 |
| : | : | : | : |
+-------+--------+----------+-------+
So for each run (identified by a run_id) there are a number of time steps with corresponding data (in the case of run_id 1, there were 9543 time steps).
Durring a simulation run, there are events which take place. These event time steps are recorded in another table:
Table: Simulation Events
+-------+--------+----------+
| id | run_id | timestep |
+-------+--------+----------+
| 1 | 1 | 152 |
| 2 | 1 | 193 |
| 3 | 1 | 382 |
| : | : | : |
| 143 | 1 | 9382 |
| 144 | 2 | 137 |
| : | : | : |
+-------+--------+----------+
So for this set of data, with run_id 1, there were events at time step 152, 193, 382, … 9382. run_id 2 has its first event at time step 137, etc. I am interested in what happens in the 3-timesteps before, the time step of, and the 3-timesteps after each event for each run_id. I would like to put together a query that returns something that looks like:
+--------+----------------+----------+-------+
| run_id | event_timestep | delta_ts | value |
+--------+----------------+----------+-------+
| 1 | 152 | -3 | 0.053 |
| 1 | 152 | -2 | 0.042 |
| 1 | 152 | -1 | 0.031 |
| 1 | 152 | 0 | 0.003 |
| 1 | 152 | 1 | 0.532 |
| 1 | 152 | 2 | 0.736 |
| 1 | 152 | 3 | 1.138 |
| 1 | 193 | -3 | 0.049 |
| : | : | : | : |
| 1 | 9382 | -3 | 0.068 |
| : | : | : | : |
| 1 | 9382 | 3 | 1.523 |
+--------+----------------+----------+-------+
Where the first row, with delta_ts = -3 would be the value from timestep 149, -2 would be from timestep 150, -1 from timestep 151, etc. Any thoughts on putting together a query that would do this?
There’s two differing points of view on this:
select ... from table t1, table t2 where ..., but you have to figure out a condition that links two rows if and only if they’re “related”. Also keep in mind that pairs are commutative in your example, so add a condition liket1.id<t2.id— also excludes self-joins.nsteps, and correlate them manually. This is slower, uses more memory, but it’s easier to write.