Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.28
-
None
Description
See also: https://bugzilla.redhat.com/show_bug.cgi?id=1086638
Description of problem:
qpid HA becomes unresponsive once any of the brokers are SIGSTOPed.
There are three different cases:
a] stopped ALL brokers
b] stopped the primary
c] stopped a backup
In any of above listed cases following observations were made:
a-c] RHCS clustat is just fine and report everything is just ok
a-c] qpid-ha (status --all) hangs
a,b,c*] any other clients are indefinitely blocked
a-b] cases directly at the beginning
c] case at the end, client able to recover after minute or so,
due to connection timeout
In fact this defect also proves that qpid-ha can be out of sync when compared to clustat as tracked by BZ.
The expectations are:
- a] quorum lost HA down (same as kill -9 to all nodes)
no clients able to communicate - b] promotion of new primary, there has to be mechanism to get rid of stopped process
clients should be able to communicate after recovery - c] unresponsive backup should get restarted
clients should be able to communicate after duration when backup is detected as unresponsive
- Generally better integration Qpid HA environment <-> RHCS is needed
aka SIGSTOP detection - Heartbeat primary <-> backups probably needed