The problem : I have multiple parallel processes that handle flat file records .

Question

0

Asked: June 10, 20262026-06-10T08:18:21+00:00 2026-06-10T08:18:21+00:00

The problem : I have multiple parallel processes that handle flat file records .

0

The problem:

I have multiple parallel processes that handle flat file records. Each file corresponds to a given interface in a telecommunications system (a message passing through the system is given a 32-digit globally unique identifier and there can be records for a given message on multiple interfaces). There is one process handling each file.

Let’s call the interfaces: A, B and C. The message string can differ according to the which interface it was written by. I am supposed to create a table that stores information about each message passing through the system. So, this table should contain (among other fields):
id, message_on_A, message_on_B, message_on_C. I’d like to avoid duplicate entries for the same id.

What i have tried is the following:

setting id as PRIMARY KEY and using INSERT ON DUPLICATE KEY UPDATE commands to set the corresponding message field for each process
breaking down id into multiple parts and using these parts as a compound primary key; the rest is the same as 1.
storing all records, then using a second query to extract all the information for each id (using GROUP BY ID, and max(message_on_A), max(message_on_B), max(message_on_C)). There is no primary key defined for this approach.

None of these approaches have been fast enough. I’m looking for a solution that can achieve a run-time of about 30 seconds for 1 million ids (so 3 million records considering 3 interfaces).

The first and second approach did the job in about 400 seconds on MyISAM tables. I have also tried on InnoDB but it was much slower.

At the moment i’m considering giving approach 3 another shot, but i need to find a much faster query (the GROUP BY and max() query lasted over 20 minutes before i terminated it)

The question:
Can anybody suggest a better schema for this problem? And a better query?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-10T08:18:22+00:00

I am thinking of a modification of the third approach. Store the data in three separate tables, with the GUId as the primary key in each table. This should make insertions happen as fast as possible. Handle duplicates at this level.

Instead of group by, try the following:

select A.id,
       A.message as A_message,
       (select B.message from B where B.id = A.id limit 1) as B_message,
       (select C.message from C where C.id = A.id limit 1) as C_message
from A

If this works, then your only problem is when messages are missing the A component. I think there is a way to fix that as well. The question is whether this achieves your performance goals.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

The problem : I have multiple parallel processes that handle flat file records .

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply