The problem:
I have multiple parallel processes that handle flat file records. Each file corresponds to a given interface in a telecommunications system (a message passing through the system is given a 32-digit globally unique identifier and there can be records for a given message on multiple interfaces). There is one process handling each file.
Let’s call the interfaces: A, B and C. The message string can differ according to the which interface it was written by. I am supposed to create a table that stores information about each message passing through the system. So, this table should contain (among other fields):
id, message_on_A, message_on_B, message_on_C. I’d like to avoid duplicate entries for the same id.
What i have tried is the following:
- setting id as PRIMARY KEY and using INSERT ON DUPLICATE KEY UPDATE commands to set the corresponding message field for each process
- breaking down id into multiple parts and using these parts as a compound primary key; the rest is the same as 1.
- storing all records, then using a second query to extract all the information for each id (using GROUP BY ID, and max(message_on_A), max(message_on_B), max(message_on_C)). There is no primary key defined for this approach.
None of these approaches have been fast enough. I’m looking for a solution that can achieve a run-time of about 30 seconds for 1 million ids (so 3 million records considering 3 interfaces).
The first and second approach did the job in about 400 seconds on MyISAM tables. I have also tried on InnoDB but it was much slower.
At the moment i’m considering giving approach 3 another shot, but i need to find a much faster query (the GROUP BY and max() query lasted over 20 minutes before i terminated it)
The question:
Can anybody suggest a better schema for this problem? And a better query?
I am thinking of a modification of the third approach. Store the data in three separate tables, with the GUId as the primary key in each table. This should make insertions happen as fast as possible. Handle duplicates at this level.
Instead of group by, try the following:
If this works, then your only problem is when messages are missing the A component. I think there is a way to fix that as well. The question is whether this achieves your performance goals.