I have 2 tables in PostgreSQL 9.1 – flight_2012_09_12 containing approx 500,000 rows and

Question

0

Asked: June 15, 20262026-06-15T10:02:07+00:00 2026-06-15T10:02:07+00:00

I have 2 tables in PostgreSQL 9.1 – flight_2012_09_12 containing approx 500,000 rows and

0

I have 2 tables in PostgreSQL 9.1 – flight_2012_09_12 containing approx 500,000 rows and position_2012_09_12 containing about 5.5 million rows. I’m running a simple join query and it’s taking a long time to complete and despite the fact the tables aren’t small I’m convinced there are some major gains to be made in the execution.

The query is:

SELECT f.departure, f.arrival, 
       p.callsign, p.flightkey, p.time, p.lat, p.lon, p.altitude_ft, p.speed 
FROM position_2012_09_12 AS p 
JOIN flight_2012_09_12 AS f 
     ON p.flightkey = f.flightkey 
WHERE p.lon < 0 
      AND p.time BETWEEN '2012-9-12 0:0:0' AND '2012-9-12 23:0:0'

The output of explain analyze is:

Hash Join  (cost=239891.03..470396.82 rows=4790498 width=51) (actual time=29203.830..45777.193 rows=4403717 loops=1)
Hash Cond: (f.flightkey = p.flightkey)
->  Seq Scan on flight_2012_09_12 f  (cost=0.00..1934.31 rows=70631 width=12) (actual time=0.014..220.494 rows=70631 loops=1)
->  Hash  (cost=158415.97..158415.97 rows=3916885 width=43) (actual time=29201.012..29201.012 rows=3950815 loops=1)
     Buckets: 2048  Batches: 512 (originally 256)  Memory Usage: 1025kB
     ->  Seq Scan on position_2012_09_12 p  (cost=0.00..158415.97 rows=3916885 width=43) (actual time=0.006..14630.058 rows=3950815 loops=1)
           Filter: ((lon < 0::double precision) AND ("time" >= '2012-09-12 00:00:00'::timestamp without time zone) AND ("time" <= '2012-09-12 23:00:00'::timestamp without time zone))
Total runtime: 58522.767 ms

I think the problem lies with the sequential scan on the position table but I can’t figure out why it’s there. The table structures with indexes are below:

               Table "public.flight_2012_09_12"
   Column       |            Type             | Modifiers 
--------------------+-----------------------------+-----------
callsign           | character varying(8)        | 
flightkey          | integer                     | 
source             | character varying(16)       | 
departure          | character varying(4)        | 
arrival            | character varying(4)        | 
original_etd       | timestamp without time zone | 
original_eta       | timestamp without time zone | 
enroute            | boolean                     | 
etd                | timestamp without time zone | 
eta                | timestamp without time zone | 
equipment          | character varying(6)        | 
diverted           | timestamp without time zone | 
time               | timestamp without time zone | 
lat                | double precision            | 
lon                | double precision            | 
altitude           | character varying(7)        | 
altitude_ft        | integer                     | 
speed              | character varying(4)        | 
asdi_acid          | character varying(4)        | 
enroute_eta        | timestamp without time zone | 
enroute_eta_source | character varying(1)        | 
Indexes:
"flight_2012_09_12_flightkey_idx" btree (flightkey)
"idx_2012_09_12_altitude_ft" btree (altitude_ft)
"idx_2012_09_12_arrival" btree (arrival)
"idx_2012_09_12_callsign" btree (callsign)
"idx_2012_09_12_departure" btree (departure)
"idx_2012_09_12_diverted" btree (diverted)
"idx_2012_09_12_enroute_eta" btree (enroute_eta)
"idx_2012_09_12_equipment" btree (equipment)
"idx_2012_09_12_etd" btree (etd)
"idx_2012_09_12_lat" btree (lat)
"idx_2012_09_12_lon" btree (lon)
"idx_2012_09_12_original_eta" btree (original_eta)
"idx_2012_09_12_original_etd" btree (original_etd)
"idx_2012_09_12_speed" btree (speed)
"idx_2012_09_12_time" btree ("time")

          Table "public.position_2012_09_12"
Column    |            Type             | Modifiers 
-------------+-----------------------------+-----------
 callsign    | character varying(8)        | 
 flightkey   | integer                     | 
 time        | timestamp without time zone | 
 lat         | double precision            | 
 lon         | double precision            | 
 altitude    | character varying(7)        | 
 altitude_ft | integer                     | 
 course      | integer                     | 
 speed       | character varying(4)        | 
 trackerkey  | integer                     | 
 the_geom    | geometry                    | 
Indexes:
"index_2012_09_12_altitude_ft" btree (altitude_ft)
"index_2012_09_12_callsign" btree (callsign)
"index_2012_09_12_course" btree (course)
"index_2012_09_12_flightkey" btree (flightkey)
"index_2012_09_12_speed" btree (speed)
"index_2012_09_12_time" btree ("time")
"position_2012_09_12_flightkey_idx" btree (flightkey)
"test_index" btree (lon)
"test_index_lat" btree (lat)

I can’t think of any other way to rewrite the query and so I’m stumped at this point. If the current setup is as good as it gets so be it but it seems to me that it should be much faster than it currently is. Any help would be much appreciated.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T10:02:08+00:00

The reason you are getting a sequential scan is that Postgres believes that it will read less disk pages that way than using indexes. It is probably right. Consider, if you use a non-covering index, you need to read all the matching index pages. it essentially outputs a list of row identifiers. The DB engine then needs to read each of the matching data pages.

Your position table uses 71 bytes per row, plus whatever a geom type takes (I’ll assume 16 bytes for illustration), making 87 bytes. A Postgres page is 8192 bytes. So you have approximately 90 rows per pages.

Your query matches 3950815 out of 5563070 rows, or about 70% of the total. Assuming the data is randomly distributed, with regard to your where filters, there is a pretty much a 30% ^ 90 chance of finding a data page with no matching row. This is essentially nothing. So regardless of how good your indexes are, you’re still going to have to read all the data pages. If you’re going to have to read all the pages anyway, a table scan is usually a good approach.

The one get out here, is that I said non-covering index. If you are prepared to create indexes that can answer queries in of themselves, you can avoid looking up the data pages at all, so you are back in the game. I’d suggest the following are worth looking at:

flight_2012_09_12 (flightkey, departure, arrival)
position_2012_09_12 (filghtkey, time, lon, ...)
position_2012_09_12 (lon, time, flightkey, ...)
position_2012_09_12 (time, long, flightkey, ...)

The dots here represent the rest of the columns you are selecting. You’ll only need one of the indexes on position, but it’s hard to tell which will prove the best. The first approach may permit a merge join on presorted data, with the cost of reading the whole second index to do the filtering. The second and third will allow data to be prefiltered, but require a hash join. Give how much of the cost appears to be in the hash join, the merge join might well be a good option.

As your query requires 52 of the 87 bytes per row, and indexes have overheads, you may not end up with the index taking much, if any, less space then the table itself.

Another approach is to attack the “randomly distributed” side of it, by looking at clustering.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have 2 tables in PostgreSQL 9.1 – flight_2012_09_12 containing approx 500,000 rows and

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply