[CURATOR-653] Double leader for LeaderLatch - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 5.4.0
Component/s: Recipes
Labels:
None

Description

Reported by @woaishixiaoxiao:

When I use the LeaderLatch to select leader, there is a double-leader phenomenon.
The timeline is as follows：
1. The zk cluster switch leader node bescause of zxid overflow. The cluster is unavailable to the outside world
2. A client(not leader befor zxid overflow) and B client(is leader before zxid overflow) enter the suspend state, B client set its leader status to false
3. The zk cluster complete the leader node election and the cluster back to normal
4. A client enter the reconnect state and call the reset function, set its leader status to false.
5. B client enter the reconnect state, call the reset function. set its leader status to false. Delete its old path.
6. A client receive preNodeDeleteEvent. Then getChildren from zkServer. Find itself is the smallest number and set itself as a leader.
7. B client create a new temporary node and then getChildren from zkServer. Find itself not the node with the smallest serial number and listen to the previous node delete event.
8. A client delete its old path.
9. B client receive the preNodeDeleteEvent. then getchildren from zkServer. Find itself is the smallest sequence number and then set itself as a leader
10. A client create a new temporary node and then getChildren from zkServer. Find itself not the node with the smallest serial number and listen to the previous node delete event. but it doesn't set itself as a non-leader state. because of the sixth step operation, A still is leader state now.
11. now A client and B client are the leader at the same time

Attachments

Issue Links

supercedes

CURATOR-444 LeaderLatch sends events that leads to simultaneously leadership after blocking zookeeper peer communication

Closed

Testing discovered

CURATOR-657 TestPathChildrenCache timed out in Java11 test run

Open

links to

GitHub Pull Request #436

Activity

People

Assignee:: Zili Chen

Reporter:: Zili Chen

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 27/Sep/22 03:16

Updated:: 11/Aug/24 14:38

Resolved:: 18/Oct/22 10:09

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 40m