[SPARK-16925] Spark tasks which cause JVM to exit with a zero exit code may cause app to hang in Standalone mode - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.6.0, 2.0.0
Fix Version/s: 1.6.3, 2.0.1, 2.1.0
Component/s: Deploy
Labels:
None

Target Version/s:

1.6.3, 2.0.1

Description

If you have a Spark standalone cluster which runs a single application and you have a Spark task which repeatedly fails by causing the executor JVM to exit with a zero exit code then this may temporarily freeze / hang the Spark application.

For example, running

        sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) }

on a cluster will cause all executors to die but those executors won't be replaced unless another Spark application or worker joins or leaves the cluster. This is caused by a bug in the standalone Master where schedule() is only called on executor exit when the exit code is non-zero, whereas I think that we should always call schedule() even on a "clean" executor shutdown since schedule() should always be safe to call.

Attachments

Issue Links

links to

[Github] Pull Request #14510 (JoshRosen)

Activity

People

Assignee:: Josh Rosen

Reporter:: Josh Rosen

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 05/Aug/16 21:04

Updated:: 07/Aug/16 02:41

Resolved:: 07/Aug/16 02:41