I would like to insert a record into a table and if the record is already present get its id, otherwise run the insert and get the new record’s id.
I will be inserting millions of records and have no idea how to do this in an efficient manner. What I am doing now is to run a select to check if the record is already present, and if not, insert it and get the inserted record’s id. As the table is growing I imagine that SELECT is going to kill me.
What I am doing now in python with psycopg2 looks like this:
select = ("SELECT id FROM ... WHERE ...", [...])
cur.execute(*select)
if not cur.rowcount:
insert = ("INSERT INTO ... VALUES ... RETURNING id", [...])
cur.execute(*insert)
rid = cur.fetchone()[0]
Is it maybe possible to do something in a stored procedure like this:
BEGIN
EXECUTE sql_insert;
RETURN id;
EXCEPTION WHEN unique_violation THEN
-- return id of already existing record
-- from the exception info ?
END;
Any ideas of how optimize a case like this?
First off, this is obviously not an
UPSERTasUPDATEwas never mentioned. Similar concurrency issues apply, though.There will always be a race condition for this kind of task, but you can minimize it to an extremely tiny time slot, while at the same time querying for the ID only once with a data-modifying CTE (introduced with PostgreSQL 9.1):
Given a table
tbl:Use this query:
xis only for convenience: enter values once.yretrieves tbl_id – if it already exists.zinserts the new row – if it doesn’t.SELECTavoids running another query on the table with theCOALESCEconstruct.Now, this can still fail if a concurrent transaction commits a new row with some_col = ‘foo’ exactly between CTE
yandz, but that’s extremely unlikely. If it happens you get a duplicate key violation and have to retry. Nothing lost. If you don’t face concurrent writes, you can just forget about this.You can put this into a plpgsql function and rerun the query on duplicate key error automatically.
Goes without saying that you need two indexes in this setup (like displayed in my
CREATE TABLEstatement above):UNIQUEorPRIMARY KEYconstraint ontbl_id(which is ofserialtype!)UNIQUEorPRIMARY KEYconstraint onsome_colBoth implement an index automatically.