[FLINK-10052] Tolerate temporarily suspended ZooKeeper connections - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.4.2, 1.5.2, 1.6.0, 1.8.1
Fix Version/s: 1.14.0
Component/s: Runtime / Coordination
Labels:
- pull-request-available

Description

This issue results from ~~FLINK-10011~~ which uncovered a problem with Flink's HA recovery and proposed the following solution to harden Flink:

The ZooKeeperLeaderElectionService uses the LeaderLatch Curator recipe for leader election. The leader latch revokes leadership in case of a suspended ZooKeeper connection. This can be premature in case that the system can reconnect to ZooKeeper before its session expires. The effect of the lost leadership is that all jobs will be canceled and directly restarted after regaining the leadership.

Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper connection, it would be better to wait until the ZooKeeper connection is LOST. That way we would allow the system to reconnect and not lose the leadership. This could be achievable by using Curator's LeaderSelector instead of the LeaderLatch.

Attachments

Issue Links

is duplicated by

FLINK-13189 Fix the impact of zookeeper network disconnect temporarily on flink long running jobs

Closed

FLINK-14111 Flink should be robust to a non-leader Zookeeper host going down

Closed

relates to

FLINK-10011 Old job resurrected during HA failover

Resolved

links to

GitHub Pull Request #9158

GitHub Pull Request #11338

GitHub Pull Request #15675

GitHub Pull Request #16801

mentioned in: Page Loading...

(2 links to, 1 mentioned in)

Activity

People

Assignee:: Till Rohrmann

Reporter:: Till Rohrmann

Votes:: 7 Vote for this issue

Watchers:: 38 Start watching this issue

Dates

Created:: 03/Aug/18 17:31

Updated:: 02/Sep/22 10:03

Resolved:: 15/Aug/21 09:54

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

50m