Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.4.1
-
None
Description
There’s a following race condition in ExecutorPodsAllocator when running a spark application with static allocation on kubernetes with numExecutors >= 1:
- Driver requests an executor
- exec-1 gets created and registers with driver
- exec-1 is moved from newlyCreatedExecutors to schedulerKnownNewlyCreatedExecs
- exec-1 got deleted very quickly (~1-30 sec) after registration
- ExecutorPodsWatchSnapshotSource fails to catch the creation of the pod (e.g. websocket connection was reset, k8s-apiserver was down, etc.)
- ExecutorPodsPollingSnapshotSource fails to catch the creation because it runs every 30 secs, but executor was removed much quicker after creation
- exec-1 is never removed from schedulerKnownNewlyCreatedExecs
- ExecutorPodsAllocator will never request new executor because it’s slot is occupied by exec-1, due to schedulerKnownNewlyCreatedExecs never being cleared.
Put up a fix here https://github.com/apache/spark/pull/42297
Attachments
Issue Links
- links to