Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-36912

Get Result time for task is taking very long time and timesout

    XMLWordPrintableJSON

Details

    • Question
    • Status: Resolved
    • Major
    • Resolution: Invalid
    • 3.0.3
    • None
    • Block Manager
    • None

    Description

      We use Spark on Kubernetes to run batch jobs to analyze flows and produce insights. The flows are read from timeseries database. We have 3 exec instances each having 5g mem + driver (5g mem). We observe the following warning followed by timeout errors after which the job fails. We have been stuck on this for some time and really hoping to get some help from this forum:2021-10-02T16:07:09.459ZGMT WARN dispatcher-CoarseGrainedScheduler TaskSetManager - Stage 52 contains a task of very large size (2842 KiB). The maximum recommended task size is 1000 KiB.

      2021-10-02T16:08:19.151ZGMT ERROR task-result-getter-0 RetryingBlockFetcher - Exception while beginning fetch of 1 outstanding blocks
      java.io.IOException: Failed to connect to /192.168.7.99:34259
      at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:253)
      at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:195)
      at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:122)
      at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)
      at org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:121)
      at org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:143)
      at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:103)
      at org.apache.spark.storage.BlockManager.fetchRemoteManagedBuffer(BlockManager.scala:1010)
      at org.apache.spark.storage.BlockManager.$anonfun$getRemoteBlock$8(BlockManager.scala:954)
      at scala.Option.orElse(Option.scala:447)
      at org.apache.spark.storage.BlockManager.getRemoteBlock(BlockManager.scala:954)
      at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:1092)
      at org.apache.spark.scheduler.TaskResultGetter$$anon$3.$anonfun$run$1(TaskResultGetter.scala:88)
      at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
      at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1934)
      at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:63)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)
      Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: /192.168.7.99:34259
      Caused by: java.net.ConnectException: Connection timed out
      at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
      at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
      at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
      at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
      at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
      at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
      at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
      at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
      at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
      at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
      at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
      at java.lang.Thread.run(Thread.java:748)
      2021-10-02T16:08:19.151ZGMT ERROR task-result-getter-2 RetryingBlockFetcher - Exception while beginning fetch of 1 outstanding blocks
      java.io.IOException: Failed to connect to /192.168.6.167:42405
      at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:253)
      at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:195)
      at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:122)
      at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)
      at org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:121)
      at org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:143)
      at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:103)
      at org.apache.spark.storage.BlockManager.fetchRemoteManagedBuffer(BlockManager.scala:1010)
      at org.apache.spark.storage.BlockManager.$anonfun$getRemoteBlock$8(BlockManager.scala:954)
      at scala.Option.orElse(Option.scala:447)
      at org.apache.spark.storage.BlockManager.getRemoteBlock(BlockManager.scala:954)
      at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:1092)
      at org.apache.spark.scheduler.TaskResultGetter$$anon$3.$anonfun$run$1(TaskResultGetter.scala:88)
      at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
      at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1934)
      at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:63)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)
      Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: /192.168.6.167:42

      Attachments

        1. Stage-result.pdf
          235 kB
          Spark101
        2. Storage-result.pdf
          133 kB
          Spark101
        3. executors.pdf
          194 kB
          Spark101
        4. thread-dump-exec3.pdf
          283 kB
          Spark101
        5. threadDump-exc2.pdf
          283 kB
          Spark101
        6. environment.pdf
          1.61 MB
          Spark101

        Activity

          People

            Unassigned Unassigned
            recSparkUser Spark101
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: