Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-10250

Kafka broker shrinks the ISRs and disconnects from other brokers for few seconds

    XMLWordPrintableJSON

Details

    Description

      The following the summary/overview of the whole issue. Can you please help us look into the below and let us know you thoughts on what caused the issue? And how to mitigate this in the future

      Issue:
      1. All of a sudden all the other brokers in cluster report have issues with one of the broker as below

      Error for good broker [broker: broker2, brokerID: 1030]
      [2020-06-27 16:14:53,367] WARN [ReplicaFetcher replicaId=1030, leaderId=1029, fetcherId=13] Error in response for fetch request (type=FetchRequest, replicaId=1030, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1901394956, epoch=1018699)) (kafka.server.ReplicaFetcherThread)
      java.io.IOException: Connection to 1029 was disconnected before the response was read
      at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
      at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:96)
      at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:240)
      at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:43)
      at kafka.server.AbstractFetcherThread.prabcssFetchRequest(AbstractFetcherThread.scala:149)
      at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:114)
      at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)

      Error from affected broker [broker: broker6, brokerID: 1029]
      [2020-06-27 16:14:25,744] INFO [Partition topic_test2-33 broker=1029] Shrinking ISR from 1025,1029,1030 to 1025,1029 (kafka.cluster.Partition)
      [2020-06-27 16:14:25,752] INFO [Partition topic_a_restate-34 broker=1029] Shrinking ISR from 1027,1029,1028 to 1029 (kafka.cluster.Partition)
      [2020-06-27 16:14:25,760] INFO [Partition topic_f_restate-39 broker=1029] Shrinking ISR from 1026,1029,1025 to 1029,1025 (kafka.cluster.Partition)
      [2020-06-27 16:14:25,772] INFO [Partition topic_test2-16 broker=1029] Shrinking ISR from 1028,1029,1030 to 1029,1030 (kafka.cluster.Partition)
      [2020-06-27 16:14:26,683] INFO [ProducerStateManager partition=topic_abc_f-21] Writing producer snapshot at offset 1509742173 (kafka.log.ProducerStateManager)
      [2020-06-27 16:14:26,683] INFO [Log partition=topic_abc_f-21, dir=/hadoop-e/kafka/data1] Rolled new log segment at offset 1509742173 in 1 ms. (kafka.log.Log)
      [2020-06-27 16:14:55,375] INFO [Partition topic_test2-33 broker=1029] Expanding ISR from 1025,1029 to 1025,1029,1030 (kafka.cluster.Partition)
      [2020-06-27 16:14:55,387] INFO [Partition test-topic-analysis-9 broker=1029] Expanding ISR from 1028,1029 to 1028,1029,1030 (kafka.cluster.Partition)

      2. Entire kafka cluster becomes stable within few minutes

      Trace for good broker [broker: broker2, brokerID: 1030]
      [2020-06-27 16:20:14,861] INFO Deleted time index /hadoop-g/kafka/data1/topic-analysis-0/00000000009100172512.timeindex.deleted. (kafka.log.LogSegment)
      [2020-06-27 16:20:14,882] INFO [Log partition=topic-analysis-4, dir=/hadoop-b/kafka/data1] Deleting segment 9100010843 (kafka.log.Log)
      [2020-06-27 16:20:14,883] INFO Deleted log /hadoop-b/kafka/data1/topic-analysis-4/00000000009100010843.log.deleted. (kafka.log.LogSegment)
      [2020-06-27 16:20:14,897] INFO Deleted offset index /hadoop-b/kafka/data1/topic-analysis-4/00000000009100010843.index.deleted. (kafka.log.LogSegment)
      [2020-06-27 16:20:14,897] INFO Deleted time index /hadoop-b/kafka/data1/topic-analysis-4/00000000009100010843.timeindex.deleted. (kafka.log.LogSegment)

      Trace from affected broker [broker: broker6, brokerID: 1029]
      [2020-06-27 16:21:01,552] INFO [Log partition=topic-analysis-22, dir=/hadoop-i/kafka/data1] Scheduling log segment [baseOffset 9099830344, size 1073733482] for deletion. (kafka.log.Log)
      [2020-06-27 16:21:01,553] INFO [Log partition=topic-analysis-22, dir=/hadoop-i/kafka/data1] Scheduling log segment [baseOffset 9100082067, size 1073738790] for deletion. (kafka.log.Log)
      [2020-06-27 16:21:01,553] INFO [Log partition=topic-analysis-22, dir=/hadoop-i/kafka/data1] Incrementing log start offset to 9100334713 (kafka.log.Log)
      [2020-06-27 16:21:01,581] INFO Cleared earliest 0 entries from epoch cache based on passed offset 9100334713 leaving 1 in EpochFile for partition topic-analysis-22 (kafka.server.epoch.LeaderEpochFileCache)
      [2020-06-27 16:22:00,175] INFO [Log partition=topic_def_c-1, dir=/hadoop-j/kafka/data1] Deleting segment 1628853412 (kafka.log.Log)
      [2020-06-27 16:22:00,175] INFO Deleted log /hadoop-j/kafka/data1/topic_def_c-1/00000000001628853412.log.deleted. (kafka.log.LogSegment)

      3. Once the disconnect would happen all the producer jobs will fail stating "Caused by: org.apache.kafka.common.errors.TimeoutException: Batch containing 1 record(s) expired due to timeout while requesting metadata from brokers for topic-analysis-10"
      a. If we describe the topics, all the topic partitions which the jobs are complaining would have the leader on the affected broker [here in this case topic-analysis-10 has leader on broker 1029]
      4. This would continue until the affected broker was restarted. Once the broker is restarted, all the jobs would run fine
      5. During the time in between the broker getting affected and we restart the broker
      a. We see lot of ISR shrinks and Expansion happening as per Grafana. This ISR expansions/shrinks would be observed until the affected broker was restarted. But the Shrinks/Expands cannot be seen in logs.

      Attachments

        Activity

          People

            Unassigned Unassigned
            RishankChandra Rishank Chandra Puram
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: