[QPID-5719] HA becomes unresponsive once any of the brokers are SIGSTOPed - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.28
Fix Version/s: 0.29
Component/s: C++ Clustering
Labels:
None

Description

See also: https://bugzilla.redhat.com/show_bug.cgi?id=1086638

Description of problem:

qpid HA becomes unresponsive once any of the brokers are SIGSTOPed.

There are three different cases:
a] stopped ALL brokers
b] stopped the primary
c] stopped a backup

In any of above listed cases following observations were made:

a-c] RHCS clustat is just fine and report everything is just ok
a-c] qpid-ha (status --all) hangs
a,b,c*] any other clients are indefinitely blocked
a-b] cases directly at the beginning
c] case at the end, client able to recover after minute or so,
due to connection timeout

In fact this defect also proves that qpid-ha can be out of sync when compared to clustat as tracked by BZ.

The expectations are:

a] quorum lost HA down (same as kill -9 to all nodes)
no clients able to communicate
b] promotion of new primary, there has to be mechanism to get rid of stopped process
clients should be able to communicate after recovery
c] unresponsive backup should get restarted
clients should be able to communicate after duration when backup is detected as unresponsive

Generally better integration Qpid HA environment <-> RHCS is needed
aka SIGSTOP detection
Heartbeat primary <-> backups probably needed

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

ha-heartbeat.diff
23/Apr/14 19:24
14 kB
Alan Conway

Activity

People

Assignee:: Alan Conway

Reporter:: Alan Conway

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 23/Apr/14 19:08

Updated:: 26/Sep/14 15:43

Resolved:: 18/Jul/14 20:18