Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.28
-
None
Description
Frantisek Reznicek 2014-07-09 08:59:30 EDT
Description of problem:
qpid HA cluster may end-up in joining state after HA primary is killed.
Test scenario.
Let's have 3 node qpid HA cluster, all three nodes are operational.
Then a sender is executed and sending to queue (pure transactional with durable messages and durable queue address).
During that process primary broker is killed multiple times.
After N'th primary broker kill cluster is no longer functional as qpid brokers are ending all in joining states:
[root@dhcp-lab-216 ~]# qpid-ha status --all
192.168.6.60:5672 joining
192.168.6.61:5672 joining
192.168.6.62:5672 joining
[root@dhcp-x-216 ~]# clustat
Cluster Status for dtests_ha @ Wed Jul 9 14:38:44 2014
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
192.168.6.60 1 Online, Local, rgmanager
192.168.6.61 2 Online, rgmanager
192.168.6.62 3 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:qpidd_1 192.168.6.60 started
service:qpidd_2 192.168.6.61 started
service:qpidd_3 192.168.6.62 started
service:qpidd_primary (192.168.6.62) stopped
[root@dhcp-x-165 ~]# qpid-ha status --all
192.168.6.60:5672 joining
192.168.6.61:5672 joining
192.168.6.62:5672 joining
[root@dhcp-x-218 ~]# qpid-ha status --all
192.168.6.60:5672 joining
192.168.6.61:5672 joining
192.168.6.62:5672 joining
I believe the key to hit the issue is to kill the newly promoted primary soon after it starts appearing in starting/started state in clustat.
My current understanding is that if we have 3 node cluster then applying any failures to single node at one time should be handled by HA. This is what the testing scenario does:
A B C (nodes)
pri bck bck
kill
bck pri bck
kill
bck bck pri
kill
...
pri bck bck
kill
bck bck bck
It looks to me that there is short time when promoting new primary when kill causes (of such primary newbee) causes promotion procedure to stuck in all joining.
I haven't seen such behavior in past, either we are now more sensitive to such case (after -STOP case fixes) or the durability turned on rapidly raises the probability.
Version-Release number of selected component (if applicable):
- rpm -qa | grep qpid | sort
perl-qpid-0.22-13.el6.i686
perl-qpid-debuginfo-0.22-13.el6.i686
python-qpid-0.22-15.el6.noarch
python-qpid-proton-doc-0.5-9.el6.noarch
python-qpid-qmf-0.22-33.el6.i686
qpid-cpp-client-0.22-42.el6.i686
qpid-cpp-client-devel-0.22-42.el6.i686
qpid-cpp-client-devel-docs-0.22-42.el6.noarch
qpid-cpp-client-rdma-0.22-42.el6.i686
qpid-cpp-debuginfo-0.22-42.el6.i686
qpid-cpp-server-0.22-42.el6.i686
qpid-cpp-server-devel-0.22-42.el6.i686
qpid-cpp-server-ha-0.22-42.el6.i686
qpid-cpp-server-linearstore-0.22-42.el6.i686
qpid-cpp-server-rdma-0.22-42.el6.i686
qpid-cpp-server-xml-0.22-42.el6.i686
qpid-java-client-0.22-6.el6.noarch
qpid-java-common-0.22-6.el6.noarch
qpid-java-example-0.22-6.el6.noarch
qpid-jca-0.22-2.el6.noarch
qpid-jca-xarecovery-0.22-2.el6.noarch
qpid-jca-zip-0.22-2.el6.noarch
qpid-proton-c-0.7-2.el6.i686
qpid-proton-c-devel-0.7-2.el6.i686
qpid-proton-c-devel-doc-0.5-9.el6.noarch
qpid-proton-debuginfo-0.7-2.el6.i686
qpid-qmf-0.22-33.el6.i686
qpid-qmf-debuginfo-0.22-33.el6.i686
qpid-qmf-devel-0.22-33.el6.i686
qpid-snmpd-1.0.0-16.el6.i686
qpid-snmpd-debuginfo-1.0.0-16.el6.i686
qpid-tests-0.22-15.el6.noarch
qpid-tools-0.22-13.el6.noarch
ruby-qpid-qmf-0.22-33.el6.i686
How reproducible:
rarely, timing is the key
Steps to Reproduce:
1. have configured 3 node cluster
2. start the whole cluster up
3. execute transactional sender to durable queue address with durable messages and reconnect
4. repeatedly kill the primary broker once it is promoted
Actual results:
After few kills cluster ends up not functional all in joining. Ability to bring qpid HA down by inserting single isolated failures to newly being promoted brokers.
Expected results:
Qpid HA should be single failure at one time tolerant.
Additional info:
Details on failure insertion:
- kill -9 `pidof qpidd` is the failure action
- Assuming the duration between failure insertion and primary is ready to serve named as T1
- failure insertion period T2 > T1 i.e. there are no cummulative failures inserted while HA is getting through new primary promotion
-> this fact (in my view) proves that there is real issue
Attachments
Issue Links
- duplicates
-
QPID-5942 qpid HA cluster may end-up in joining state after HA primary is killed
- Closed