[SPARK-15865] Blacklist should not result in job hanging with less than 4 executors - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.1.0
Component/s: Scheduler, Spark Core
Labels:
None

Description

Currently when you turn on blacklisting with spark.scheduler.executorTaskBlacklistTime, but you have fewer than spark.task.maxFailures executors, you can end with a job "hung" after some task failures.

If some task fails regularly (say, due to error in user code), then the task will be blacklisted from the given executor. It will then try another executor, and fail there as well. However, after it has tried all available executors, the scheduler will simply stop trying to schedule the task anywhere. The job doesn't fail, nor it does it succeed – it simply waits. Eventually, when the blacklist expires, the task will be scheduled again. But that can be quite far in the future, and in the meantime the user just observes a stuck job.

Instead we should abort the stage (and fail any dependent jobs) as soon as we detect tasks that cannot be scheduled.

Attachments

Issue Links

breaks

SPARK-17304 TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler benchmark

Resolved

links to

[Github] Pull Request #13603 (squito)

Activity

People

Assignee:: Imran Rashid

Reporter:: Imran Rashid

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 10/Jun/16 06:14

Updated:: 17/May/20 17:47

Resolved:: 30/Jun/16 18:36