Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-4418

Broker Leadership Election Fails If Missing ZK Path Raises Exception

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.9.0.1, 0.10.0.0, 0.10.0.1
    • None
    • zkclient

    Description

      Our Kafka cluster went down because a single node went down and a path in Zookeeper was missing for one topic (/brokers/topics/<topicname>/partitions). When this occurred, leadership election could not run, and produced a stack trace that looked like this:

      Failed to start preferred replica election
      org.I0Itec.zkclient.exception.ZkNoNodeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/topics/warandpeace/partitions
      at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47)
      at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:995)
      at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:675)
      at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:671)
      at kafka.utils.ZkUtils.getChildren(ZkUtils.scala:537)
      at kafka.utils.ZkUtils$$anonfun$getAllPartitions$1.apply(ZkUtils.scala:817)
      at kafka.utils.ZkUtils$$anonfun$getAllPartitions$1.apply(ZkUtils.scala:816)
      at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
      at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
      at scala.collection.Iterator$class.foreach(Iterator.scala:727)
      at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
      at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
      at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
      at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
      at scala.collection.AbstractTraversable.map(Traversable.scala:105)
      at kafka.utils.ZkUtils.getAllPartitions(ZkUtils.scala:816)
      at kafka.admin.PreferredReplicaLeaderElectionCommand$.main(PreferredReplicaLeaderElectionCommand.scala:64)
      at kafka.admin.PreferredReplicaLeaderElectionCommand.main(PreferredReplicaLeaderElectionCommand.scala)
      Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/topics/warandpeace/partitions
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
      at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
      at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
      at org.I0Itec.zkclient.ZkConnection.getChildren(ZkConnection.java:114)
      at org.I0Itec.zkclient.ZkClient$4.call(ZkClient.java:678)
      at org.I0Itec.zkclient.ZkClient$4.call(ZkClient.java:675)
      at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:985)
      ... 16 more

      I have checked through the code a bit, and have found a quick place to introduce a fix that would seem to allow the leadership election to continue. Specifically, the function at https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/utils/ZkUtils.scala#L633 does not handle possible exceptions. Wrapping a try/catch block here would work, but could introduce a number of other problems:

      • If the code is used elsewhere, the exception might be needed at a higher level to prevent something else.
      • Unless the exception is logged/reported somehow, no one will know this problem exists, which makes debugging other problems harder.

      I'm sure there are other issues I'm not aware of, but those two come to mind quickly. What would be the best route for getting this resolved quickly?

      Attachments

        Activity

          People

            Unassigned Unassigned
            pedersen Michael Pedersen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: