Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
aws-connector-3.0.0, 1.15.4, aws-connector-4.1.0, 1.16.2, 1.17.1
Description
Connector calls to AWS Kinesis Data Streams can hang indefinitely without making any progress.
We suspect the root cause to be related to the SDK handling of exceptions, similarly to what observed in FLINK-31675.
We identified this deadlock on applications running on AWS Kinesis Data Analytics using the AWS Kinesis Data Streams AsyncSink (with AWS SDK version 2.20.32 as per FLINK-31675). The deadlock scenario is still the same as described in FLINK-31675. However, the Netty content-length exception does not appear when using the updated SDK version.
This issue only occurs for applications and streams in the AWS regions ap-northeast-3 and us-gov-east-1. We did not observe this issue in any other AWS region.
The issue happens sporadically and unpredictably. As per its nature, we do not have instructions for reproducing it.
Example of flame-graphs observed when the issue occurs:
root
java.lang.Thread.run:829
org.apache.flink.runtime.taskmanager.Task.run:568
org.apache.flink.runtime.taskmanager.Task.doRun:746
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke:932
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring:953
org.apache.flink.runtime.taskmanager.Task$$Lambda$1253/0x0000000800ecbc40.run:-1
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke:753
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop:804
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop:203
org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$907/0x0000000800bf7840.runDefaultAction:-1
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput:519
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput:65
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext:110
org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.pollNext:159
org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.handleEvent:181
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.processBarrier:231
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.markCheckpointAlignedAndTransformState:262
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$$Lambda$1586/0x00000008012c5c40.apply:-1
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.lambda$processBarrier$2:234
org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.barrierReceived:66
org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.triggerGlobalCheckpoint:74
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$ControllerImpl.triggerGlobalCheckpoint:493
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.access$100:64
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.triggerCheckpoint:287
org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierHandler.notifyCheckpoint:147
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier:1198
org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint:1241
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing:50
org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$1453/0x000000080128e840.run:-1
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$12:1253
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState:300
org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.prepareSnapshotPreBarrier:89
org.apache.flink.streaming.runtime.operators.sink.SinkWriterOperator.prepareSnapshotPreBarrier:165
org.apache.flink.connector.base.sink.writer.AsyncSinkWriter.flush:494
org.apache.flink.connector.base.sink.writer.AsyncSinkWriter.yieldIfThereExistsInFlightRequests:503
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.yield:84
org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.take:149
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await:2211
java.util.concurrent.locks.LockSupport.parkNanos:234
jdk.internal.misc.Unsafe.park:-2