Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-17201

Spark job with s3acommitter stuck at the last stage

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.2.1
    • None
    • fs/s3
    • we are on spark 2.4.5/hadoop 3.2.1 with s3a committer.
      spark.hadoop.fs.s3a.committer.magic.enabled: 'true'
      spark.hadoop.fs.s3a.committer.name: magic

    Description

      usually our spark job took 1 hour or 2 to finish, occasionally it runs for more than 3 hour and then we know it's stuck and usually the executor has stack like this

      {{
      "Executor task launch worker for task 78620" #265 daemon prio=5 os_prio=0 tid=0x00007f73e0005000 nid=0x12d waiting on condition [0x00007f74cb291000]
      java.lang.Thread.State: TIMED_WAITING (sleeping)
      at java.lang.Thread.sleep(Native Method)
      at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:349)
      at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285)
      at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteObjects(S3AFileSystem.java:1457)
      at org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeys(S3AFileSystem.java:1717)
      at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:2785)
      at org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:2751)
      at org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$finalizeMultipartUpload$1(WriteOperationHelper.java:238)
      at org.apache.hadoop.fs.s3a.WriteOperationHelper$$Lambda$210/1059071691.execute(Unknown Source)
      at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
      at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265)
      at org.apache.hadoop.fs.s3a.Invoker$$Lambda$23/586859139.execute(Unknown Source)
      at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)
      at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261)
      at org.apache.hadoop.fs.s3a.WriteOperationHelper.finalizeMultipartUpload(WriteOperationHelper.java:226)
      at org.apache.hadoop.fs.s3a.WriteOperationHelper.completeMPUwithRetries(WriteOperationHelper.java:271)
      at org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.complete(S3ABlockOutputStream.java:660)
      at org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.access$200(S3ABlockOutputStream.java:521)
      at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:385)
      at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
      at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
      at org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64)
      at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685)
      at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122)
      at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
      at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
      at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
      at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
      at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
      at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
      at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
      at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
      at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
      at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
      at org.apache.spark.scheduler.Task.run(Task.scala:123)
      at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
      at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)

      Locked ownable synchronizers:

      • <0x00000003a57332e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)

      }}
      captured jstack on the stuck executors in case it's useful.

      Attachments

        1. exec-5.log
          93 kB
          Dyno Fu
        2. exec-7.log
          95 kB
          Dyno Fu
        3. exec-31.log
          89 kB
          Dyno Fu
        4. exec-25.log
          99 kB
          Dyno Fu
        5. exec-36.log
          136 kB
          Dyno Fu
        6. exec-44.log
          97 kB
          Dyno Fu
        7. exec-120.log
          119 kB
          Dyno Fu
        8. exec-64.log
          91 kB
          Dyno Fu
        9. exec-125.log
          148 kB
          Dyno Fu

        Issue Links

          Activity

            githubbot ASF GitHub Bot logged work - 22/Oct/20 15:53
            • Time Spent:
              10m
               
              steveloughran opened a new pull request #2402:
              URL: https://github.com/apache/hadoop/pull/2402


                 
                 make putObject & putObjectDirect retrying everywhere, update @RetryPolicy
                 annotation, and make sure callers are not attempting to retry the methods
                 
                 test failure related to inconistent s3 client creating too many throttle events for the retrier to handle...need to look @ my settings to see if I've turned off the AWS SDK retries there
                 
                 ```
                 [ERROR] Tests run: 17, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 60.785 s <<< FAILURE! - in org.apache.hadoop.fs.s3a.commit.ITestCommitOperations
                 [ERROR] testCommitEmptyFile(org.apache.hadoop.fs.s3a.commit.ITestCommitOperations) Time elapsed: 3.289 s <<< ERROR!
                 com.amazonaws.AmazonServiceException: throttled count = 1 (Service: null; Status Code: 503; Error Code: null; Request ID: null)
                  at org.apache.hadoop.fs.s3a.InconsistentAmazonS3Client.maybeFail(InconsistentAmazonS3Client.java:571)
                  at org.apache.hadoop.fs.s3a.InconsistentAmazonS3Client.maybeFail(InconsistentAmazonS3Client.java:586)
                  at org.apache.hadoop.fs.s3a.InconsistentAmazonS3Client.putObject(InconsistentAmazonS3Client.java:226)
                  at com.amazonaws.services.s3.transfer.internal.UploadCallable.uploadInOneChunk(UploadCallable.java:131)
                  at com.amazonaws.services.s3.transfer.internal.UploadCallable.call(UploadCallable.java:123)
                  at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:143)
                  at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:48)
                  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                  at java.lang.Thread.run(Thread.java:748)
                 ```


              ----------------------------------------------------------------
              This is an automated message from the Apache Git Service.
              To respond to the message, please log on to GitHub and use the
              URL above to go to the specific comment.

              For queries about this service, please contact Infrastructure at:
              users@infra.apache.org
            githubbot ASF GitHub Bot logged work - 22/Oct/20 17:13
            • Time Spent:
              10m
               
              hadoop-yetus commented on pull request #2402:
              URL: https://github.com/apache/hadoop/pull/2402#issuecomment-714636636


                 :broken_heart: **-1 overall**
                 
                 
                 
                 
                 
                 
                 | Vote | Subsystem | Runtime | Logfile | Comment |
                 |:----:|----------:|--------:|:--------:|:-------:|
                 | +0 :ok: | reexec | 0m 36s | | Docker mode activated. |
                 |||| _ Prechecks _ |
                 | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. |
                 | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. |
                 | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. |
                 |||| _ trunk Compile Tests _ |
                 | +1 :green_heart: | mvninstall | 31m 43s | | trunk passed |
                 | +1 :green_heart: | compile | 0m 44s | | trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 |
                 | +1 :green_heart: | compile | 0m 42s | | trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 |
                 | +1 :green_heart: | checkstyle | 0m 31s | | trunk passed |
                 | +1 :green_heart: | mvnsite | 0m 46s | | trunk passed |
                 | +1 :green_heart: | shadedclient | 16m 37s | | branch has no errors when building and testing our client artifacts. |
                 | +1 :green_heart: | javadoc | 0m 24s | | trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 |
                 | +1 :green_heart: | javadoc | 0m 33s | | trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 |
                 | +0 :ok: | spotbugs | 1m 13s | | Used deprecated FindBugs config; considering switching to SpotBugs. |
                 | +1 :green_heart: | findbugs | 1m 9s | | trunk passed |
                 |||| _ Patch Compile Tests _ |
                 | +1 :green_heart: | mvninstall | 0m 39s | | the patch passed |
                 | +1 :green_heart: | compile | 0m 36s | | the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 |
                 | +1 :green_heart: | javac | 0m 36s | | the patch passed |
                 | +1 :green_heart: | compile | 0m 29s | | the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 |
                 | +1 :green_heart: | javac | 0m 29s | | the patch passed |
                 | +1 :green_heart: | checkstyle | 0m 20s | | the patch passed |
                 | +1 :green_heart: | mvnsite | 0m 35s | | the patch passed |
                 | +1 :green_heart: | whitespace | 0m 0s | | The patch has no whitespace issues. |
                 | +1 :green_heart: | shadedclient | 15m 10s | | patch has no errors when building and testing our client artifacts. |
                 | +1 :green_heart: | javadoc | 0m 20s | | the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 |
                 | +1 :green_heart: | javadoc | 0m 28s | | the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 |
                 | +1 :green_heart: | findbugs | 1m 12s | | the patch passed |
                 |||| _ Other Tests _ |
                 | +1 :green_heart: | unit | 1m 15s | | hadoop-aws in the patch passed. |
                 | +1 :green_heart: | asflicense | 0m 34s | | The patch does not generate ASF License warnings. |
                 | | | 77m 24s | | |
                 
                 
                 | Subsystem | Report/Notes |
                 |----------:|:-------------|
                 | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2402/1/artifact/out/Dockerfile |
                 | GITHUB PR | https://github.com/apache/hadoop/pull/2402 |
                 | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle |
                 | uname | Linux b615f35ec1fe 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
                 | Build tool | maven |
                 | Personality | dev-support/bin/hadoop.sh |
                 | git revision | trunk / 7435604a91a |
                 | Default Java | Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 |
                 | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 |
                 | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2402/1/testReport/ |
                 | Max. process+thread count | 458 (vs. ulimit of 5500) |
                 | modules | C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws |
                 | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2402/1/console |
                 | versions | git=2.17.1 maven=3.6.0 findbugs=4.1.3 |
                 | Powered by | Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org |
                 
                 
                 This message was automatically generated.
                 
                 


              ----------------------------------------------------------------
              This is an automated message from the Apache Git Service.
              To respond to the message, please log on to GitHub and use the
              URL above to go to the specific comment.

              For queries about this service, please contact Infrastructure at:
              users@infra.apache.org
            githubbot ASF GitHub Bot logged work - 17/Oct/21 00:31

            People

              Unassigned Unassigned
              Fu Dyno Fu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h