I have a table that holds phone calls, with the following fields:
- ID
- STARTTIME
- ENDTIME
- STATUS
- CALL_FROM
- CALL_TO
There are 2,9 million records loaded into a local PostgreSQL database. I added indexes on ID (unique index), starttime and endtime.
Searching on stackoverflow, I found some useful SQL and modified it to what I think logically should work. The problem is that the query runs for many hours and never returns:
SELECT T1.sid, count(*) as CountSimultaneous
FROM calls_nov T1, calls_nov T2
WHERE
T1.StartTime between T2.StartTime and T2.EndTime
and T1.StartTime between '2011-11-02' and '2011-11-03'
GROUP BY
T1.sid
ORDER BY CountSimultaneous DESC;
Can someone please either suggest a way to fix the query/index so that it actually works or suggest another way to calculate concurrent calls?
EDIT:
Explain plan:
Sort (cost=11796758237.81..11796758679.47 rows=176663 width=35)
Sort Key: (count(*))
-> GroupAggregate (cost=0.00..11796738007.56 rows=176663 width=35)
-> Nested Loop (cost=0.00..11511290152.45 rows=57089217697 width=35)
Table creation script:
CREATE TABLE calls_nov (
sid varchar,
starttime timestamp,
endtime timestamp,
call_to varchar,
call_from varchar,
status varchar);
Index creation:
CREATE UNIQUE INDEX sid_unique_index on calls_nov (sid);
CREATE INDEX starttime_index on calls_nov (starttime);
CREATE INDEX endtime_index on calls_nov (endtime);
1.) Your query did not catch all overlaps – this was fixed by the other answers, already.
2.) The data type of your columns
starttimeandendtimeistimestamp. So yourWHEREclause is slightly wrong, too:This would include ‘2011-11-03 00:00’. The upper border has to be excluded.
3.) Removed the mixed case syntax without double-quotes. Unquoted identifiers are cast to lower case automatically. To put it simple: Best don’t use mixed case identifiers at all in PostgreSQL.
4.) Transformed the query to use explicit JOIN which is always preferable. Actually, I made it a LEFT [OUTER] JOIN, because I want to count calls that overlap with no other calls, too.
5.) Simplified the syntax a bit to arrive at this base query:
This query is extremely slow for a big table, because every row starting on ‘2011-11-02’ has to be compared to every row in the whole table, which leads to (almost) O(n²) cost.
Faster
We can drastically cut down the cost by pre-selecting possible candidates. Only select columns and rows you need. I do this with two CTE.
xy)x. -> CTEyFaster yet
I have a real life table of 350.000 rows with overlapping start / end timestamps similar to yours. I used that for a quick benchmark. PostgreSQL 8.4, scarce resources because it is a test DB. Indexes on
startandend. (Index on ID column is irrelevant here.) Tested withEXPLAIN ANALYZE, best of 5.Total runtime: 476994.774 ms
CTE variant:
Total runtime: 4199.788 ms — that’s > factor 100.
After adding a multicolumn index of the form:
Total runtime: 4159.367 ms
Ultimate Speed
If that is not enough, there is a way to speed it up yet another order of magnitude. Instead of the CTEs above, materialize the temp tables and – this is the crucial point – create an index on the second one. Could look like this:
Execute as one transaction:
Read about temporary tables in the manual.
Ultimate solution
Create a plpgsql function that encapsulates the magic.
Diagnose the typical size of your temp tables. Create them standalone and measure:
If they are bigger than your setting for temp_buffers then temporarily set them high enough in your function to hold both your temporary tables in RAM. It is a major speedup if you don’t have to swap to disc. (Must be first use of temp tables in session to have effect.)
Call:
Total runtime: 138.169 ms — that’s factor 3000
What else can you do to speed it up?
General performance optimization.