Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.4.6
-
None
-
None
Description
App failed in 6 minutes, driver has been stuck for > 8 hours. I would expect driver to fail if app fails.
Thread dump from jstack (on the driver pid) attached (j1.out)
Last part of stdout driver log attached (full log is 23MB, stderr log just has launch command)
Last part of app logs attached
Can see that "org.apache.spark.util.ShutdownHookManager - Shutdown hook called" line never appears in the driver log after "org.apache.spark.SparkContext - Successfully stopped SparkContext"
Using spark 2.4.6 with spark standalone mode. spark-submit to REST API (port 6066) in cluster mode was used. Other drivers/apps have worked fine with this setup, just this one getting stuck. My cluster has 1 EC2 dedicated as spark master and 1 Spot EC2 dedicated as spark worker. They can auto heal/spot terminate at any time. From checking aws logs: the worker was terminated at 01:53:38
I think you can replicate this by tearing down worker machine while app is running. You might have to try several times.
Similar to https://issues.apache.org/jira/browse/SPARK-24617 i raised before!
Attachments
Attachments
Issue Links
- relates to
-
SPARK-24617 Spark driver not requesting another executor once original executor exits due to 'lost worker'
- Resolved