Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.2.1
-
None
-
we are on spark 2.4.5/hadoop 3.2.1 with s3a committer.
spark.hadoop.fs.s3a.committer.magic.enabled: 'true'
spark.hadoop.fs.s3a.committer.name: magic
Description
usually our spark job took 1 hour or 2 to finish, occasionally it runs for more than 3 hour and then we know it's stuck and usually the executor has stack like this
{{
"Executor task launch worker for task 78620" #265 daemon prio=5 os_prio=0 tid=0x00007f73e0005000 nid=0x12d waiting on condition [0x00007f74cb291000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:349)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285)
at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteObjects(S3AFileSystem.java:1457)
at org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeys(S3AFileSystem.java:1717)
at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:2785)
at org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:2751)
at org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$finalizeMultipartUpload$1(WriteOperationHelper.java:238)
at org.apache.hadoop.fs.s3a.WriteOperationHelper$$Lambda$210/1059071691.execute(Unknown Source)
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265)
at org.apache.hadoop.fs.s3a.Invoker$$Lambda$23/586859139.execute(Unknown Source)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)
at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261)
at org.apache.hadoop.fs.s3a.WriteOperationHelper.finalizeMultipartUpload(WriteOperationHelper.java:226)
at org.apache.hadoop.fs.s3a.WriteOperationHelper.completeMPUwithRetries(WriteOperationHelper.java:271)
at org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.complete(S3ABlockOutputStream.java:660)
at org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.access$200(S3ABlockOutputStream.java:521)
at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:385)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
at org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64)
at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122)
at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers:
- <0x00000003a57332e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
}}
captured jstack on the stuck executors in case it's useful.
Attachments
Attachments
Issue Links
- relates to
-
HADOOP-17063 S3A deleteObjects hanging/retrying forever
- Open
- links to
Activity
-
- Time Spent:
- 10m
-
steveloughran opened a new pull request #2402:
URL: https://github.com/apache/hadoop/pull/2402
make putObject & putObjectDirect retrying everywhere, update @RetryPolicy
annotation, and make sure callers are not attempting to retry the methods
test failure related to inconistent s3 client creating too many throttle events for the retrier to handle...need to look @ my settings to see if I've turned off the AWS SDK retries there
```
[ERROR] Tests run: 17, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 60.785 s <<< FAILURE! - in org.apache.hadoop.fs.s3a.commit.ITestCommitOperations
[ERROR] testCommitEmptyFile(org.apache.hadoop.fs.s3a.commit.ITestCommitOperations) Time elapsed: 3.289 s <<< ERROR!
com.amazonaws.AmazonServiceException: throttled count = 1 (Service: null; Status Code: 503; Error Code: null; Request ID: null)
at org.apache.hadoop.fs.s3a.InconsistentAmazonS3Client.maybeFail(InconsistentAmazonS3Client.java:571)
at org.apache.hadoop.fs.s3a.InconsistentAmazonS3Client.maybeFail(InconsistentAmazonS3Client.java:586)
at org.apache.hadoop.fs.s3a.InconsistentAmazonS3Client.putObject(InconsistentAmazonS3Client.java:226)
at com.amazonaws.services.s3.transfer.internal.UploadCallable.uploadInOneChunk(UploadCallable.java:131)
at com.amazonaws.services.s3.transfer.internal.UploadCallable.call(UploadCallable.java:123)
at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:143)
at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:48)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
-
- Time Spent:
- 10m
-
hadoop-yetus commented on pull request #2402:
URL: https://github.com/apache/hadoop/pull/2402#issuecomment-714636636
:broken_heart: **-1 overall**
| Vote | Subsystem | Runtime | Logfile | Comment |
|:----:|----------:|--------:|:--------:|:-------:|
| +0 :ok: | reexec | 0m 36s | | Docker mode activated. |
|||| _ Prechecks _ |
| +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. |
| +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. |
| -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. |
|||| _ trunk Compile Tests _ |
| +1 :green_heart: | mvninstall | 31m 43s | | trunk passed |
| +1 :green_heart: | compile | 0m 44s | | trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 |
| +1 :green_heart: | compile | 0m 42s | | trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 |
| +1 :green_heart: | checkstyle | 0m 31s | | trunk passed |
| +1 :green_heart: | mvnsite | 0m 46s | | trunk passed |
| +1 :green_heart: | shadedclient | 16m 37s | | branch has no errors when building and testing our client artifacts. |
| +1 :green_heart: | javadoc | 0m 24s | | trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 |
| +1 :green_heart: | javadoc | 0m 33s | | trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 |
| +0 :ok: | spotbugs | 1m 13s | | Used deprecated FindBugs config; considering switching to SpotBugs. |
| +1 :green_heart: | findbugs | 1m 9s | | trunk passed |
|||| _ Patch Compile Tests _ |
| +1 :green_heart: | mvninstall | 0m 39s | | the patch passed |
| +1 :green_heart: | compile | 0m 36s | | the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 |
| +1 :green_heart: | javac | 0m 36s | | the patch passed |
| +1 :green_heart: | compile | 0m 29s | | the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 |
| +1 :green_heart: | javac | 0m 29s | | the patch passed |
| +1 :green_heart: | checkstyle | 0m 20s | | the patch passed |
| +1 :green_heart: | mvnsite | 0m 35s | | the patch passed |
| +1 :green_heart: | whitespace | 0m 0s | | The patch has no whitespace issues. |
| +1 :green_heart: | shadedclient | 15m 10s | | patch has no errors when building and testing our client artifacts. |
| +1 :green_heart: | javadoc | 0m 20s | | the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 |
| +1 :green_heart: | javadoc | 0m 28s | | the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 |
| +1 :green_heart: | findbugs | 1m 12s | | the patch passed |
|||| _ Other Tests _ |
| +1 :green_heart: | unit | 1m 15s | | hadoop-aws in the patch passed. |
| +1 :green_heart: | asflicense | 0m 34s | | The patch does not generate ASF License warnings. |
| | | 77m 24s | | |
| Subsystem | Report/Notes |
|----------:|:-------------|
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2402/1/artifact/out/Dockerfile |
| GITHUB PR | https://github.com/apache/hadoop/pull/2402 |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle |
| uname | Linux b615f35ec1fe 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/bin/hadoop.sh |
| git revision | trunk / 7435604a91a |
| Default Java | Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 |
| Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 |
| Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2402/1/testReport/ |
| Max. process+thread count | 458 (vs. ulimit of 5500) |
| modules | C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws |
| Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2402/1/console |
| versions | git=2.17.1 maven=3.6.0 findbugs=4.1.3 |
| Powered by | Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org |
This message was automatically generated.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
-
- Time Spent:
- 10m
-
steveloughran closed pull request #2402:
URL: https://github.com/apache/hadoop/pull/2402
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org