Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
2.4.0
Description
There seems to be a race condition that is now causing a rejoining member to potentially get stuck infinitely initiating a rejoin. The relevant client logs are attached (streams.log.tgz; all others attachments are broker logs), but basically it repeats this message (and nothing else) continuously until killed/shutdown:
[2019-11-05 01:53:54,699] INFO [Consumer clientId=StreamsUpgradeTest-a4c1cff8-7883-49cd-82da-d2cdfc33a2f0-StreamThread-1-consumer, groupId=StreamsUpgradeTest] Generation data was cleared by heartbeat thread. Initiating rejoin. (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
The message that appears was added as part of the bugfix (PR 7460) for this related race condition: KAFKA-8104.
This issue was uncovered by the Streams version probing upgrade test, which fails with a varying frequency. Here is the rate of failures for different system test runs so far:
trunk (cooperative): 1/1 and 2/10 failures
2.4 (cooperative) : 0/10 and 1/15 failures
trunk (eager): 0/10 failures
I've kicked off some high-repeat runs to complete overnight and hopefully shed more light.
Note that I have also kicked off runs of both 2.4 and trunk with the PR for KAFKA-8104 reverted. Both of them saw 2/10 failures, due to hitting the bug that was fixed by PR 7460. It is therefore unclear whether PR 7460 introduced another or a new race condition/bug, or merely uncovered an existing one that previously would have first failed due to KAFKA-8104.
Attachments
Attachments
Issue Links
- is related to
-
KAFKA-8104 Consumer cannot rejoin to the group after rebalancing
- Resolved
- links to