The schema
I have a MySQL database with one large table (5 million rows say). This table has several fields for actual data, an optional comment field, and fields to record when the row was first added and when the data is deleted. To simplify to one “data” column, it looks a bit like this:
+----+------+---------+---------+----------+
| id | data | comment | created | deleted |
+----+------+---------+---------+----------+
| 1 | val1 | NULL | 1 | 2 |
| 2 | val2 | nice | 1 | NULL |
| 3 | val3 | NULL | 2 | NULL |
| 4 | val4 | NULL | 2 | 3 |
| 5 | val5 | NULL | 3 | NULL |
This schema allows us to look at any past version of the data thanks to the created and deleted fields e.g.
SET @version=1;
SELECT data, comment FROM MyTable
WHERE created <= @version AND
(deleted IS NULL OR deleted > @version);
+------+---------+
| data | comment |
+------+---------+
| val1 | NULL |
| val2 | nice |
The current version of the data can be fetched more simply:
SELECT data, comment FROM MyTable WHERE deleted IS NULL;
+------+---------+
| data | comment |
+------+---------+
| val2 | nice |
| val3 | NULL |
| val5 | NULL |
DDL:
CREATE TABLE `MyTable` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`data` varchar(32) NOT NULL,
`comment` varchar(32) DEFAULT NULL,
`created` int(11) NOT NULL,
`deleted` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `data` (`data`,`comment`)
) ENGINE=InnoDB;
Updating
Periodically a new set of data and comments arrives. This can be fairly large, half a million rows say. I need to update MyTable so that this new data set is stored in it. This means:
- “Deleting” old rows. Note the “scare quotes” – we don’t actually delete rows from
MyTable. We have to set thedeletedfield to the new versionN. This has to be done for all rows inMyTablethat are in the previous versionN-1, but are not in the new set. - Inserting new rows. All rows that are in the new set and are not in version
N-1inMyTablemust be added as new rows with thecreatedfield set to the new versionN, anddeletedas NULL.
Some rows in the new set may match existing rows in MyTable at version N-1 in which case there is nothing to do.
My current solution
Given that we have to “diff” two sets of data to work out the deletions, we can’t just read over the new data and do insertions as appropriate. I can’t think of a way to do the diff operation without dumping all the new data into a temporary table first. So my strategy goes like this:
-- temp table uses MyISAM for speed.
CREATE TEMPORARY TABLE tempUpdate (
`data` char(32) NOT NULL,
`comment` char(32) DEFAULT NULL,
PRIMARY KEY (`data`),
KEY (`data`, `comment`)
) ENGINE=MyISAM;
-- Bulk insert thousands of rows
INSERT INTO tempUpdate VALUES
('some new', NULL),
('other', 'comment'),
...
-- Start transaction for the update
BEGIN;
SET @newVersion = 5; -- Worked out out-of-band
-- Do the "deletions". The join selects all non-deleted rows in MyTable for
-- which the matching row in tempUpdate does not exist (tempUpdate.data is NULL)
UPDATE MyTable
LEFT JOIN tempUpdate
ON MyTable.data = tempUpdate.data AND
MyTable.comment <=> tempUpdate.comment
SET MyTable.deleted = @newVersion
WHERE tempUpdate.data IS NULL AND
MyTable.deleted IS NULL;
-- Delete all rows from the tempUpdate table that match rows in the current
-- version (deleted is null) to leave just new rows.
DELETE tempUpdate.*
FROM MyTable RIGHT JOIN tempUpdate
ON MyTable.data = tempUpdate.data AND
MyTable.comment <=> tempUpdate.comment
WHERE MyTable.id IS NOT NULL AND
MyTable.deleted IS NULL;
-- All rows left in tempUpdate are new so add them.
INSERT INTO MyTable (data, comment, created)
SELECT DISTINCT tempUpdate.data, tempUpdate.comment, @newVersion
FROM tempUpdate;
COMMIT;
DROP TEMPORARY TABLE IF EXISTS tempUpdate;
The question (at last)
I need to find the fastest way to do this update operation. I can’t change the schema for MyTable, so any solution must work with that constraint. Can you think of a faster way to do the update operation, or suggest speed-ups to my existing method?
I have a Python script for testing the timings of different update strategies and checking their correctness over several versions. It’s fairly long but I can edit into the question if people think it would be useful.
One of speed-ups is for loading — LOAD DATA INFILE.