I wrote this python script to import a specific xls file into mysql. It works fine but if it’s run twice on the same data it will create duplicate entries. I’m pretty sure I need to use MySQL JOIN but I’m not clear on how to do that. Also is executemany() going to have the same overhead as doing inserts in a loop? I’m obviously trying to avoid that.
Here’s the code in question…
for row in range(sheet.nrows):
"""name is in the 0th col. email is the 4th col."""
name = sheet.cell(row, 0).value
email = sheet.cell(row, 4).value
if name and email:
mailing_list[name.lstrip()] = email.strip()
for n, e in sorted(mailing_list.iteritems()):
rows.append((n, e))
db = MySQLdb.connect(host=host, user=user, db=dbname, passwd=pwd)
cursor = db.cursor()
cursor.executemany("""
INSERT IGNORE INTO mailing_list (name, email) VALUES (%s,%s)""",(rows))
CLARIFICATION…
I read here that…
To be sure, executemany() is effectively the same as simple iteration.
However, it is typically faster. It provides an optimized means of
affecting INSERT and REPLACE across multiple rows.
Also I took Unodes suggestion and used the UNIQUE constraint. But the IGNORE keyword is better than ON DUPLICATE KEY UPDATE because I want it to fail silently.
TL;DR
1. What’s the best way prevent duplicate inserts?
ANSWER 1: UNIQUE contraint on column with SELECT IGNORE to fail silently or ON DUPLICATE KEY UPDATE to increment the duplicate value and insert it.
- Is executemany() as expensive as INSERT in a loop?
@Unode says it’s not but my research tells me otherwise. I would like a definitive answer. - Is this the best way or is it going to be really slow with bigger
tables and how would I test to be sure?
1 – What’s the best way prevent duplicate inserts?
Depending on what “preventing” means in your case, you have two strategies and one requirement.
The requirement is that you add a
UNIQUEconstraint on the column/columns that you want to be unique. This alone will cause an error if insertion of a duplicate entry is attempted. However given you are usingexecutemanythe outcome may not be what you would expect.Then as strategies you can do:
An initial filter step by running a
SELECTstatement before. This means running one SELECT statement per item in your rows to check if it exists already. This strategy works but is inefficient.Using
ON DUPLICATE KEY UPDATE. This automatically triggers an update if the data already exists. For more information refer to the official documentation.2 – Is executemany() as expensive as INSERT in a loop?
No,
executemanycreates one query which inserts in bulk while doing a for loop will create as many queries as the number of elements in yourrows.