[SPARK-44609] ExecutorPodsAllocator doesn't create new executors if no pod snapshot captured pod creation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.4.1
Fix Version/s: None
Component/s: Kubernetes, Scheduler
Labels:
- pull-request-available

Description

There’s a following race condition in ExecutorPodsAllocator when running a spark application with static allocation on kubernetes with numExecutors >= 1:

Driver requests an executor
exec-1 gets created and registers with driver
exec-1 is moved from newlyCreatedExecutors to schedulerKnownNewlyCreatedExecs
exec-1 got deleted very quickly (~1-30 sec) after registration
ExecutorPodsWatchSnapshotSource fails to catch the creation of the pod (e.g. websocket connection was reset, k8s-apiserver was down, etc.)
ExecutorPodsPollingSnapshotSource fails to catch the creation because it runs every 30 secs, but executor was removed much quicker after creation
exec-1 is never removed from schedulerKnownNewlyCreatedExecs
ExecutorPodsAllocator will never request new executor because it’s slot is occupied by exec-1, due to schedulerKnownNewlyCreatedExecs never being cleared.

Put up a fix here https://github.com/apache/spark/pull/42297

Attachments

Issue Links

links to

GitHub Pull Request #42297

Activity

People

Assignee:: Unassigned

Reporter:: Alibi Yeslambek

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 31/Jul/23 13:01

Updated:: 10/May/24 00:19