[KAFKA-10250] Kafka broker shrinks the ISRs and disconnects from other brokers for few seconds - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.0.0
Fix Version/s: None
Component/s: build, config, consumer, controller, log, producer
Labels:
None

Description

The following the summary/overview of the whole issue. Can you please help us look into the below and let us know you thoughts on what caused the issue? And how to mitigate this in the future

Issue:
1. All of a sudden all the other brokers in cluster report have issues with one of the broker as below

Error for good broker [broker: broker2, brokerID: 1030]
[2020-06-27 16:14:53,367] WARN [ReplicaFetcher replicaId=1030, leaderId=1029, fetcherId=13] Error in response for fetch request (type=FetchRequest, replicaId=1030, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1901394956, epoch=1018699)) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 1029 was disconnected before the response was read
at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:96)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:240)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:43)
at kafka.server.AbstractFetcherThread.prabcssFetchRequest(AbstractFetcherThread.scala:149)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:114)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)

Error from affected broker [broker: broker6, brokerID: 1029]
[2020-06-27 16:14:25,744] INFO [Partition topic_test2-33 broker=1029] Shrinking ISR from 1025,1029,1030 to 1025,1029 (kafka.cluster.Partition)
[2020-06-27 16:14:25,752] INFO [Partition topic_a_restate-34 broker=1029] Shrinking ISR from 1027,1029,1028 to 1029 (kafka.cluster.Partition)
[2020-06-27 16:14:25,760] INFO [Partition topic_f_restate-39 broker=1029] Shrinking ISR from 1026,1029,1025 to 1029,1025 (kafka.cluster.Partition)
[2020-06-27 16:14:25,772] INFO [Partition topic_test2-16 broker=1029] Shrinking ISR from 1028,1029,1030 to 1029,1030 (kafka.cluster.Partition)
[2020-06-27 16:14:26,683] INFO [ProducerStateManager partition=topic_abc_f-21] Writing producer snapshot at offset 1509742173 (kafka.log.ProducerStateManager)
[2020-06-27 16:14:26,683] INFO [Log partition=topic_abc_f-21, dir=/hadoop-e/kafka/data1] Rolled new log segment at offset 1509742173 in 1 ms. (kafka.log.Log)
[2020-06-27 16:14:55,375] INFO [Partition topic_test2-33 broker=1029] Expanding ISR from 1025,1029 to 1025,1029,1030 (kafka.cluster.Partition)
[2020-06-27 16:14:55,387] INFO [Partition test-topic-analysis-9 broker=1029] Expanding ISR from 1028,1029 to 1028,1029,1030 (kafka.cluster.Partition)

2. Entire kafka cluster becomes stable within few minutes

Trace for good broker [broker: broker2, brokerID: 1030]
[2020-06-27 16:20:14,861] INFO Deleted time index /hadoop-g/kafka/data1/topic-analysis-0/00000000009100172512.timeindex.deleted. (kafka.log.LogSegment)
[2020-06-27 16:20:14,882] INFO [Log partition=topic-analysis-4, dir=/hadoop-b/kafka/data1] Deleting segment 9100010843 (kafka.log.Log)
[2020-06-27 16:20:14,883] INFO Deleted log /hadoop-b/kafka/data1/topic-analysis-4/00000000009100010843.log.deleted. (kafka.log.LogSegment)
[2020-06-27 16:20:14,897] INFO Deleted offset index /hadoop-b/kafka/data1/topic-analysis-4/00000000009100010843.index.deleted. (kafka.log.LogSegment)
[2020-06-27 16:20:14,897] INFO Deleted time index /hadoop-b/kafka/data1/topic-analysis-4/00000000009100010843.timeindex.deleted. (kafka.log.LogSegment)

Trace from affected broker [broker: broker6, brokerID: 1029]
[2020-06-27 16:21:01,552] INFO [Log partition=topic-analysis-22, dir=/hadoop-i/kafka/data1] Scheduling log segment [baseOffset 9099830344, size 1073733482] for deletion. (kafka.log.Log)
[2020-06-27 16:21:01,553] INFO [Log partition=topic-analysis-22, dir=/hadoop-i/kafka/data1] Scheduling log segment [baseOffset 9100082067, size 1073738790] for deletion. (kafka.log.Log)
[2020-06-27 16:21:01,553] INFO [Log partition=topic-analysis-22, dir=/hadoop-i/kafka/data1] Incrementing log start offset to 9100334713 (kafka.log.Log)
[2020-06-27 16:21:01,581] INFO Cleared earliest 0 entries from epoch cache based on passed offset 9100334713 leaving 1 in EpochFile for partition topic-analysis-22 (kafka.server.epoch.LeaderEpochFileCache)
[2020-06-27 16:22:00,175] INFO [Log partition=topic_def_c-1, dir=/hadoop-j/kafka/data1] Deleting segment 1628853412 (kafka.log.Log)
[2020-06-27 16:22:00,175] INFO Deleted log /hadoop-j/kafka/data1/topic_def_c-1/00000000001628853412.log.deleted. (kafka.log.LogSegment)

3. Once the disconnect would happen all the producer jobs will fail stating "Caused by: org.apache.kafka.common.errors.TimeoutException: Batch containing 1 record(s) expired due to timeout while requesting metadata from brokers for topic-analysis-10"
a. If we describe the topics, all the topic partitions which the jobs are complaining would have the leader on the affected broker [here in this case topic-analysis-10 has leader on broker 1029]
4. This would continue until the affected broker was restarted. Once the broker is restarted, all the jobs would run fine
5. During the time in between the broker getting affected and we restart the broker
a. We see lot of ISR shrinks and Expansion happening as per Grafana. This ISR expansions/shrinks would be observed until the affected broker was restarted. But the Shrinks/Expands cannot be seen in logs.

Kafka broker shrinks the ISRs and disconnects from other brokers for few seconds

Details

Description

Attachments

Activity

People

Dates