I have the following MySQL instances, along with a replication setup:
S1 —–> (M1 <–> M2), where:
M1 – M2 is a multi-master replication setup,
S1 – a slave which replicates the writes which are done at Master M1.
Now, I’m trying to enhance the setup with a channel failover mechanism, where S1 would start replicating from M2, should M1 go down. Currently, the only way of doing this that I see is:
(M1 failure detection mechanism on S1 machine), then:
-> S1 gets the latest timestamp of M1’s queries from the local relay log file.
-> M2 searches (bash script using mysqlbinlog utility) for the local binlog file + binlog index which corresponds to S1’s latest timestamp
-> S1 can finally do a “STOP SLAVE”, “CHANGE MASTER TO master_host=M2… master_log_file=… master_log_pos=…”, etc. command to continue replication, but from M2 this time
Is there a better (and less error prone) way of doing this?
Thank you
EDIT: Nowadays, this is much easier to achieve thanks to the unique Xid binlog query tags commonly used by the publicly accessible MySQL clustering solutions.
There is a more simplistic way to retrieve the binlog and position needed.
Would it make more sense to just use the current binlog and position as M2 knows it ? You need to check the Slave status on M2.
Example
For this display, there are five crucial components:
Relay_Log_SpacePlease note
Relay_Log_Space. Once this number stops incrementing, every SQL statement imported from the Master has been read. Unfortunately, it is possible that the last relay log may be corrupt or simply incomplete because of a sudden failover.Replication CoordinatesPlease also note that the Replication Coordinates
(Relay_Master_Log_File, Exec_Master_Log_Pos). This is the position you are hunting for. However, likeRelay_Log_Spaceit may still be incrementing. In fact, those Replication Coordinates should be equal to the other Replication Coordinates(Master_Log_File,Read_Master_Log_Pos ). That’s when you know everything is caught up. If the pair of Replication Coordinates never meet, then you should rely onRelay_Log_Spacea little more in terms of when it stops incrementing.What about
Seconds_Behind_Master?The reason you cannot use
Seconds_Behind_Masteris simple. Once a Master goes down hard, all it takes just one Replication thread (Slave_IO_RunningorSlave_SQL_Running) to becomeNoandSeconds_Behind_MasterturnsNULL.