Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-21263

Job hangs under backpressure

    XMLWordPrintableJSON

Details

    Description

      We have a flink job that runs fine for a few days but suddenly hangs and could never recover. Once we relanuch the job, the job runs fine. We detected the job has backpressure, but in all other cases, backpressure would only lead to slower consumption but what is wired here is the job made no progress at all. The symptoms looks similar with FLINK-20618

       

      About the job:
      1. Reads from Kafka and writes to Kafka

      2. version 1.11

      3. enabled unaligned checkpoint

       

      symptoms:

      1. All source/sink throughput drop to 0
      2. All checkpoint fails immediately after triggering.
      3. backpressure shows "high" from source to two downstream operators. 
      4. Flamegraph shows all subtask threads are in waiting
      5. Source jstack shows the Source thread is BLOCKED, as belows.
      Source: impression-reader -> impression-filter -> impression-data-conversion (1/60)
      Stack Trace is:
      java.lang.Thread.State: WAITING (parking)
      at sun.misc.Unsafe.park(Native Method)
      - parking to wait for <0x00000003a3e71330> (a java.util.concurrent.CompletableFuture$Signaller)
      at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
      at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
      at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
      at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
      at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
      at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:293)
      at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:266)
      at org.apache.flink.runtime.io.network.partition.ResultPartition.getBufferBuilder(ResultPartition.java:213)
      at org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:294)
      at org.apache.flink.runtime.io.network.api.writer.ChannelSelectorRecordWriter.requestNewBufferBuilder(ChannelSelectorRecordWriter.java:103)
      at org.apache.flink.runtime.io.network.api.writer.ChannelSelectorRecordWriter.getBufferBuilder(ChannelSelectorRecordWriter.java:95)
      at org.apache.flink.runtime.io.network.api.writer.RecordWriter.copyFromSerializerToTargetChannel(RecordWriter.java:135)
      at org.apache.flink.runtime.io.network.api.writer.ChannelSelectorRecordWriter.broadcastEmit(ChannelSelectorRecordWriter.java:80)
      at org.apache.flink.streaming.runtime.io.RecordWriterOutput.emitWatermark(RecordWriterOutput.java:121)
      at org.apache.flink.streaming.api.operators.CountingOutput.emitWatermark(CountingOutput.java:41)
      at org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:570)
      at org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitWatermark(OperatorChain.java:638)
      at org.apache.flink.streaming.api.operators.CountingOutput.emitWatermark(CountingOutput.java:41)
      at org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:570)
      at org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitWatermark(OperatorChain.java:638)
      at org.apache.flink.streaming.api.operators.CountingOutput.emitWatermark(CountingOutput.java:41)
      at org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndEmitWatermark(StreamSourceContexts.java:315)
      at org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.emitWatermark(StreamSourceContexts.java:425)
      - locked <0x00000006a485dab0> (a java.lang.Object)
      at org.apache.flink.streaming.connectors.kafka.internals.SourceContextWatermarkOutputAdapter.emitWatermark(SourceContextWatermarkOutputAdapter.java:37)
      at org.apache.flink.api.common.eventtime.WatermarkOutputMultiplexer.updateCombinedWatermark(WatermarkOutputMultiplexer.java:167)
      at org.apache.flink.api.common.eventtime.WatermarkOutputMultiplexer.onPeriodicEmit(WatermarkOutputMultiplexer.java:136)
      at org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher$PeriodicWatermarkEmitter.onProcessingTime(AbstractFetcher.java:574)
      - locked <0x00000006a485dab0> (a java.lang.Object)
      at org.apache.flink.streaming.runtime.tasks.StreamTask.invokeProcessingTimeCallback(StreamTask.java:1220)
      at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$null$16(StreamTask.java:1211)
      at org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$590/1066788035.run(Unknown Source)
      at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:92)
      - locked <0x00000006a485dab0> (a java.lang.Object)
      at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78)
      at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:282)
      at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxStep(MailboxProcessor.java:190)
      at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:181)
      at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:566)
      at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:536)
      at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721)
      at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546)
      at java.lang.Thread.run(Thread.java:748)

       

       

      Attachments

        1. source_js1
          168 kB
          Lu Niu
        2. source_js2
          167 kB
          Lu Niu
        3. source_js3
          167 kB
          Lu Niu
        4. source_graph.svg
          406 kB
          Lu Niu

        Activity

          People

            Unassigned Unassigned
            qqibrow Lu Niu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: