Let’s assume that I have a table with two columns A and B. There is an index on column A but not on column B.
I want to issue several million queries like:
UPDATE t1 SET b=b1 WHERE a=a1;
UPDATE t1 SET b=b2 WHERE a=a2;
....
There are anywhere from 1 to 100,000 rows corresponding to each unique value of a. On average it’s around 100.
For each update statement on average 60% of the rows will not be changed because for these rows b already has the desired value. For 30% of the updates none of the matching rows will be changed.
Does it make sense to use statements like these?
UPDATE t1 SET b=b1 WHERE a=a1 AND b<>b1;
Will it speedup the process by eliminating unneeded writebacks to disk or is Mysql 5 smart enough to recognize that nothing is being changed and there is no need to write back to disk?
In either case, MySQL will have to read the row content (whether it is on disk or in a cache/bufferpool). In either case, MySQL will use your index on
aas a starting point. In either case, MySQL will not update the row if it already has the destination value forb. Consequently, I don’t see any way in which MySQL would benefit from having theb<>b1clause.Arguably, depending on the workload and dataset, the query (with
b<>b1in it) might benefit if you change your index onato be a compound index onaandb(in that order). In this case, it wouldn’t have to hit disk (or check the cache/buffer pool) to find which rows specifically require updating (i.e. you’d capitalize on the 30% and 60% you mentioned). Having said that, now your index will require an update for each update onb, so there is a cost, although I suspect the trade-off might be worth it.