Suppose I am storing events associated with users in a table as follows (with dt standing in for the timestamp of the event):
| dt | user | event |
| 1 | 1 | A |
| 2 | 1 | D |
| 3 | 1 | B |
| 4 | 1 | C |
| 5 | 1 | B |
| 6 | 2 | B |
| 7 | 2 | B |
| 8 | 2 | A |
| 9 | 2 | A |
| 10 | 2 | C |
Such that we could say:
- user 1 has an event-sequence of ADBCB
- user 2 has event-sequence BBAAC
The types of questions I would want to answer about these users are very easy to express as regular expresions on the event-sequences, e.g. “which users have an event-sequence matching A.*B?” or “which users have an event-sequence matching A[^C]*B[^C]*D?” etc.
What would be a good SQL technique or operator I could use to answer similar queries over this table structure?
Is there a way to efficiently/dynamically generate a table of user-to-event-sequence which could then be queried with regex?
I am currently looking at using Postgres, but I am curious to know if any of the bigger DBMS’s like SQLServer or Oracle have specialized operators for this as well.
With Postgres 9.x this is actually quite easy:
Using that result you can now apply a regular expression on the event_sequence:
With Postgres 8.x you need to find a replacement for the string_agg() function (just google for it, there are a lot of examples out there) and you need a sub-select to ensure the ordering of the aggregate as 8.x does support an
order byin an aggregate function.