[HADOOP-17201] Spark job with s3acommitter stuck at the last stage - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.2.1
Fix Version/s: None
Component/s: fs/s3
Labels:
- pull-request-available
Environment:

we are on spark 2.4.5/hadoop 3.2.1 with s3a committer.
spark.hadoop.fs.s3a.committer.magic.enabled: 'true'
spark.hadoop.fs.s3a.committer.name: magic

Description

usually our spark job took 1 hour or 2 to finish, occasionally it runs for more than 3 hour and then we know it's stuck and usually the executor has stack like this

{{
"Executor task launch worker for task 78620" #265 daemon prio=5 os_prio=0 tid=0x00007f73e0005000 nid=0x12d waiting on condition [0x00007f74cb291000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:349)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285)
at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteObjects(S3AFileSystem.java:1457)
at org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeys(S3AFileSystem.java:1717)
at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:2785)
at org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:2751)
at org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$finalizeMultipartUpload$1(WriteOperationHelper.java:238)
at org.apache.hadoop.fs.s3a.WriteOperationHelper$$Lambda$210/1059071691.execute(Unknown Source)
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265)
at org.apache.hadoop.fs.s3a.Invoker$$Lambda$23/586859139.execute(Unknown Source)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)
at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261)
at org.apache.hadoop.fs.s3a.WriteOperationHelper.finalizeMultipartUpload(WriteOperationHelper.java:226)
at org.apache.hadoop.fs.s3a.WriteOperationHelper.completeMPUwithRetries(WriteOperationHelper.java:271)
at org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.complete(S3ABlockOutputStream.java:660)
at org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.access$200(S3ABlockOutputStream.java:521)
at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:385)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
at org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64)
at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122)
at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Locked ownable synchronizers:

<0x00000003a57332e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)

}}
captured jstack on the stuck executors in case it's useful.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

exec-5.log
10/Aug/20 21:07
93 kB
Dyno Fu
exec-7.log
10/Aug/20 21:07
95 kB
Dyno Fu
exec-31.log
10/Aug/20 21:07
89 kB
Dyno Fu
exec-25.log
10/Aug/20 21:07
99 kB
Dyno Fu
exec-36.log
10/Aug/20 21:07
136 kB
Dyno Fu
exec-44.log
10/Aug/20 21:07
97 kB
Dyno Fu
exec-120.log
10/Aug/20 21:07
119 kB
Dyno Fu
exec-64.log
10/Aug/20 21:07
91 kB
Dyno Fu
exec-125.log
10/Aug/20 21:07
148 kB
Dyno Fu

Issue Links

relates to

HADOOP-17063 S3A deleteObjects hanging/retrying forever

Open

links to

GitHub Pull Request #2402

Activity

Ascending order - Click to sort in descending order

Permalink

ASF GitHub Bot logged work - 22/Oct/20 15:53

Time Spent:

10m

steveloughran opened a new pull request #2402:
URL: https://github.com/apache/hadoop/pull/2402


   make putObject & putObjectDirect retrying everywhere, update @RetryPolicy
   annotation, and make sure callers are not attempting to retry the methods

   test failure related to inconistent s3 client creating too many throttle events for the retrier to handle...need to look @ my settings to see if I've turned off the AWS SDK retries there

   ```
   [ERROR] Tests run: 17, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 60.785 s <<< FAILURE! - in org.apache.hadoop.fs.s3a.commit.ITestCommitOperations
   [ERROR] testCommitEmptyFile(org.apache.hadoop.fs.s3a.commit.ITestCommitOperations) Time elapsed: 3.289 s <<< ERROR!
   com.amazonaws.AmazonServiceException: throttled count = 1 (Service: null; Status Code: 503; Error Code: null; Request ID: null)
    at org.apache.hadoop.fs.s3a.InconsistentAmazonS3Client.maybeFail(InconsistentAmazonS3Client.java:571)
    at org.apache.hadoop.fs.s3a.InconsistentAmazonS3Client.maybeFail(InconsistentAmazonS3Client.java:586)
    at org.apache.hadoop.fs.s3a.InconsistentAmazonS3Client.putObject(InconsistentAmazonS3Client.java:226)
    at com.amazonaws.services.s3.transfer.internal.UploadCallable.uploadInOneChunk(UploadCallable.java:131)
    at com.amazonaws.services.s3.transfer.internal.UploadCallable.call(UploadCallable.java:123)
    at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:143)
    at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:48)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Permalink

ASF GitHub Bot logged work - 22/Oct/20 17:13

Time Spent:

10m

hadoop-yetus commented on pull request #2402:
URL: https://github.com/apache/hadoop/pull/2402#issuecomment-714636636

   :broken_heart: **-1 overall**






   | Vote | Subsystem | Runtime | Logfile | Comment |
   |:----:|----------:|--------:|:--------:|:-------:|
   | +0 :ok: | reexec | 0m 36s | | Docker mode activated. |
   |||| _ Prechecks _ |
   | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. |
   | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. |
   | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. |
   |||| _ trunk Compile Tests _ |
   | +1 :green_heart: | mvninstall | 31m 43s | | trunk passed |
   | +1 :green_heart: | compile | 0m 44s | | trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 |
   | +1 :green_heart: | compile | 0m 42s | | trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 |
   | +1 :green_heart: | checkstyle | 0m 31s | | trunk passed |
   | +1 :green_heart: | mvnsite | 0m 46s | | trunk passed |
   | +1 :green_heart: | shadedclient | 16m 37s | | branch has no errors when building and testing our client artifacts. |
   | +1 :green_heart: | javadoc | 0m 24s | | trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 |
   | +1 :green_heart: | javadoc | 0m 33s | | trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 |
   | +0 :ok: | spotbugs | 1m 13s | | Used deprecated FindBugs config; considering switching to SpotBugs. |
   | +1 :green_heart: | findbugs | 1m 9s | | trunk passed |
   |||| _ Patch Compile Tests _ |
   | +1 :green_heart: | mvninstall | 0m 39s | | the patch passed |
   | +1 :green_heart: | compile | 0m 36s | | the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 |
   | +1 :green_heart: | javac | 0m 36s | | the patch passed |
   | +1 :green_heart: | compile | 0m 29s | | the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 |
   | +1 :green_heart: | javac | 0m 29s | | the patch passed |
   | +1 :green_heart: | checkstyle | 0m 20s | | the patch passed |
   | +1 :green_heart: | mvnsite | 0m 35s | | the patch passed |
   | +1 :green_heart: | whitespace | 0m 0s | | The patch has no whitespace issues. |
   | +1 :green_heart: | shadedclient | 15m 10s | | patch has no errors when building and testing our client artifacts. |
   | +1 :green_heart: | javadoc | 0m 20s | | the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 |
   | +1 :green_heart: | javadoc | 0m 28s | | the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 |
   | +1 :green_heart: | findbugs | 1m 12s | | the patch passed |
   |||| _ Other Tests _ |
   | +1 :green_heart: | unit | 1m 15s | | hadoop-aws in the patch passed. |
   | +1 :green_heart: | asflicense | 0m 34s | | The patch does not generate ASF License warnings. |
   | | | 77m 24s | | |


   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2402/1/artifact/out/Dockerfile |
   | GITHUB PR | https://github.com/apache/hadoop/pull/2402 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle |
   | uname | Linux b615f35ec1fe 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 7435604a91a |
   | Default Java | Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 |
   | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 |
   | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2402/1/testReport/ |
   | Max. process+thread count | 458 (vs. ulimit of 5500) |
   | modules | C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws |
   | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2402/1/console |
   | versions | git=2.17.1 maven=3.6.0 findbugs=4.1.3 |
   | Powered by | Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org |


   This message was automatically generated.



----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Permalink

ASF GitHub Bot logged work - 17/Oct/21 00:31

Time Spent:

10m

steveloughran closed pull request #2402:
URL: https://github.com/apache/hadoop/pull/2402

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Hadoop Common

Spark job with s3acommitter stuck at the last stage

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Time Tracking