What are the cases a queue manager can loose its connectivy to repository in cluster encironment?
I have an environment where a queue manager is losing its connectivity to repository often and i need to refresh the cluster to fix this and to re-establish communication with other queue manager in the cluster.
Our cluster has 100 queue managers and we have 2 repositories in it.
There are a few different issues that can cause this. One is if there are explicitly defined
CLUSSDRchannels pointing to a non-repository QMgr. This causes repository messages to arrive at the non-repos QMgr which can cause itsamqrrmfarepository process to die. Another is that there have been a few APARS (such as this one) which can lead to that process dieing. The solutions, respectively, are to fix the configuration issues or to apply the latest Fix Pack. Another issue, less commonly seen, is that a message to a new QMgr will error out before the new QMgr can resolve to the local QMgr. In this case, theREFRESHdoesn’t actually cause the remote QMgr to resolve, it just provides time for the resolution to complete.Debugging this involves isolating the possible causes. Check that
amqrrmfais running. Check that all non-repository QMgrs have one and ONLY one explicitly defined CLUSSDR channel. Verify that all repositories have one and ONLY one explicitly defined CLUSSDR to each other repository. If overlapping clusters are used make sure to NOT overlap the channels. This means avoiding channel names likeTO.QMGRand prefer names likeCLUSTER.QMGR. Verify this by insuring channels do not use theCLUSNLattribute and use theCLUSTERattribute instead. Finally, reconcile the objects in both repositories and the non-repository by issuingDIS CLUSQMGR(*)andDIS QCLUSTER(*). The repositories should have identical object inventories. If that’s wrong then there’s the problem. The non-repository should have an entry for every QMgr it has previously talked to.One thing I have seen in the past was that an administrator had scheduled a
REFRESH CLUSTER. His thinking was that this was something they needed to do to fix the cluster so why not run it on a regular basis? So he scheduled it to run daily. Then each night it made the QMgr forget about the other QMgrs in the cluster and the first time an app resolved a remote QMgr each day there was a flurry of repository traffic. This caused enough of a delay that there were a few 2087 errors each morning. Not that you would do such a thing. 🙂