Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-10052

Tolerate temporarily suspended ZooKeeper connections

    XMLWordPrintableJSON

Details

    Description

      This issue results from FLINK-10011 which uncovered a problem with Flink's HA recovery and proposed the following solution to harden Flink:

      The ZooKeeperLeaderElectionService uses the LeaderLatch Curator recipe for leader election. The leader latch revokes leadership in case of a suspended ZooKeeper connection. This can be premature in case that the system can reconnect to ZooKeeper before its session expires. The effect of the lost leadership is that all jobs will be canceled and directly restarted after regaining the leadership.

      Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper connection, it would be better to wait until the ZooKeeper connection is LOST. That way we would allow the system to reconnect and not lose the leadership. This could be achievable by using Curator's LeaderSelector instead of the LeaderLatch.

      Attachments

        Issue Links

          Activity

            People

              trohrmann Till Rohrmann
              trohrmann Till Rohrmann
              Votes:
              7 Vote for this issue
              Watchers:
              38 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m