Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-32230

Deadlock in AWS Kinesis Data Streams AsyncSink connector

    XMLWordPrintableJSON

Details

    Description

      Connector calls to AWS Kinesis Data Streams can hang indefinitely without making any progress.

      We suspect the root cause to be related to the SDK handling of exceptions, similarly to what observed in FLINK-31675.

      We identified this deadlock on applications running on AWS Kinesis Data Analytics using the AWS Kinesis Data Streams AsyncSink (with AWS SDK version 2.20.32 as per FLINK-31675). The deadlock scenario is still the same as described in FLINK-31675. However, the Netty content-length exception does not appear when using the updated SDK version.

      This issue only occurs for applications and streams in the AWS regions ap-northeast-3 and us-gov-east-1. We did not observe this issue in any other AWS region.

      The issue happens sporadically and unpredictably. As per its nature, we do not have instructions for reproducing it.

      Example of flame-graphs observed when the issue occurs:

      root
      java.lang.Thread.run:829
      org.apache.flink.runtime.taskmanager.Task.run:568
      org.apache.flink.runtime.taskmanager.Task.doRun:746
      org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke:932
      org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring:953
      org.apache.flink.runtime.taskmanager.Task$$Lambda$1253/0x0000000800ecbc40.run:-1
      org.apache.flink.streaming.runtime.tasks.StreamTask.invoke:753
      org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop:804
      org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop:203
      org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$907/0x0000000800bf7840.runDefaultAction:-1
      org.apache.flink.streaming.runtime.tasks.StreamTask.processInput:519
      org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput:65
      org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext:110
      org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.pollNext:159
      org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.handleEvent:181
      org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.processBarrier:231
      org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.markCheckpointAlignedAndTransformState:262
      org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$$Lambda$1586/0x00000008012c5c40.apply:-1
      org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.lambda$processBarrier$2:234
      org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.barrierReceived:66
      org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.triggerGlobalCheckpoint:74
      org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$ControllerImpl.triggerGlobalCheckpoint:493
      org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.access$100:64
      org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.triggerCheckpoint:287
      org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierHandler.notifyCheckpoint:147
      org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier:1198
      org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint:1241
      org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing:50
      org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$1453/0x000000080128e840.run:-1
      org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$12:1253
      org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState:300
      org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.prepareSnapshotPreBarrier:89
      org.apache.flink.streaming.runtime.operators.sink.SinkWriterOperator.prepareSnapshotPreBarrier:165
      org.apache.flink.connector.base.sink.writer.AsyncSinkWriter.flush:494
      org.apache.flink.connector.base.sink.writer.AsyncSinkWriter.yieldIfThereExistsInFlightRequests:503
      org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.yield:84
      org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.take:149
      java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await:2211
      java.util.concurrent.locks.LockSupport.parkNanos:234
      jdk.internal.misc.Unsafe.park:-2 

       

      Attachments

        Activity

          People

            antoniovespoli Antonio Vespoli
            antoniovespoli Antonio Vespoli
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: