The table contains about 40,000,000 records having:
CREATE TABLE `event` (
`id` bigint(20) unsigned NOT NULL auto_increment,
`some_other_id_not_fk` int(10) unsigned default NOT NULL,
`event_time` datetime NOT NULL,
`radius` float default NULL,
`how_heavy` smallint(6) default NULL,
PRIMARY KEY (`id`),
KEY `event_some_other_id_not_fk` (`some_other_id_not_fk`),
KEY `event_event_time` (`event_time`)
) ENGINE=MyISAM AUTO_INCREMENT=6506226 DEFAULT CHARSET=utf8
You should know that some_other_id_not_fk column is not big, it contains distinctively only 7 different numbers. The real pain is the event_time datetime column, as it contains extremely large amounts of different datetime’s, and basicly everything is allowed: duplicates as well as unpredictably large time intervals without records to ‘cover’ them. You should also know that (some_other_id_not_fk,event_time) pair must be allowed to have duplicates either 🙁 I know this causes even more problems 🙁
I’ve had some experience in optimizing MySQL tables, but such a huge pain had never appeared on my horizon :/
The current state of ‘the things’ is:
- The selects by
event_timebetween date1 and date2 (which I need to do) are satisfactorily fast. 🙂 - My inserts are slow, I mean really SLOW!!! more then a 30 secs, and even worse: LOAD DATA procedures that temporary DISABLE and ENABLE KEYS are EXTREMELY slow(several hours), mainly on ENABLE keys operation.
- The size of the index on the disk is 7 times bigger then the size of the data
I would have tried several different combinations of re-indexing till now, but the size of that data really prevents me from experimenting on indexes and columns drop/create at will.
Please help anyone had managed this ? Should using timestamp instead of datetime solve my problem? Or maybe I should add additional columns for day, year,… etc and index on them ?
Do you really need a BIGINT? You can probably get away with an INT. If you were to insert 1,000 rows per second 24 hours a day, it would take 136 years for you to exhaust all values in an unsigned 32-bit integer.
This change will decrease your table size by 152.5 MB for 40 million rows, and will decrease the size of your primary key index by 158.8 MB for 40 million rows.
You state this has only 7 distinct values. Does it need to be an INT type then? Could you use TINYINT instead? This will drastically reduce index size.
This will decrease the size of your table by 114.4 MB for 40 million rows, and will decrease the size of the
some_other_id_not_fkindex by approximately the same.Do you need a DATETIME? DATETIME’s take 8 bytes, a TIMESTAMP takes 4 bytes. If you can use a TIMESTAMP then this will drastically reduce data and index size. Be aware of the limitations of TIMESTAMP fields though such as Y2K38 and how they behave with respect to timezones and replication.
This change will decrease your table size by 152.5 MB for 40 million rows, and will decrease the size of your primary key index by 158.8 MB for 40 million rows.
These three changes will significantly reduce the size of your data as well as the indices.
Total Space Savings
Total: 852MB
As others have suggested, you may not even need all the indices that you have defined. With such a low selectivity on
some_other_id_not_fkthere’s a good chance the query optimizer won’t even use that index and will instead opt for a full table scan. Dropping this index completely would result in a significant space savings for your indices.If you could provide some sample queries, I can help you further.
Also, are you inserting into this table under a heavy read load? Keep in mind that SELECTs in MyISAM will block an INSERT.
Update
Most people are suggesting moving your
some_other_id_not_fkfield into theevent_timeindex so the new index would be on(event_time, some_other_id_not_fk). I will recommend the same, but with an important caveat.This index will be good for queries where you are filtering only on
event_time, or if you filter on bothevent_timeandsome_other_id_not_fk. It will not be used for queries filtering only onsome_other_id_not_fk– a full table scan will occur.Moreover, if your queries are always filtering on both
event_timeandsome_other_id_not_fkthen do not use the index order of(event_time, some_other_id_not_fk). Rather, you should use the index(some_other_id_not_fk, event_time)instead.Having the least selective (most duplicates) field first will allow for much greater compression for your index and thus a significantly reduced footprint on disk.