[SPARK-29443] Spark-on-Kubernetes: driver pod keeps spinning/terminating executor pods - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.4.4
Fix Version/s: None
Component/s: Kubernetes, Spark Core, Spark Submit
Labels:
- Kubernetes
- bulk-closed
Environment:

EKS version: v1.14.6-eks-5047ed with 18 x r5a.12xlarge EC2 instances as Worker nodes
Spark version: 2.4.4
Kubernetes client version: 4.4.2

Description

We are running Spark on Kubernetes (EKS), there are some Spark jobs which driver pod keep creating and terminating executor pods. Executor pods only live for few seconds. Some jobs have been stuck for days, and spun up to thousand executor pods.

➜ kubectl get pods --field-selector=status.phase=Running -n spark
NAME READY STATUS RESTARTS AGE
log-parser-k8s.structured-log.growth-referral-late-1570797184638 1/1 Running 0 111m
log-parser-k8s.structured-log.growth-referral-late-1570798665864 1/1 Running 0 87m
log-parser-k8s.structured-log.growth-referral-late-1570801912533 1/1 Running 0 33m
log-parser-k8s.structured-log.growth-referral-late-1570803823618 1/1 Running 0 77s
log-parser-k8s.structured-log.orbital-latest-struc-1570797051892 1/1 Running 0 114m
log-parser-k8s.structured-log.orbital-latest-struc-1570803823552 1/1 Running 0 77s
logparserstructured-1570797185983-exec-1727 1/1 Running 0 4s
logparserstructured-1570801913787-exec-431 1/1 Running 0 4s
logparserstructured-1570803824926-exec-1 1/1 Running 0 69s
logparserstructured-1570803824926-exec-2 1/1 Running 0 69s

Run kubectl get pods, we can see bunch of executor pods are being terminated.

logparserstructured-1570786196291-exec-1511 0/1 Terminating 0 7s
logparserstructured-1570786196291-exec-1513 0/1 Terminating 0 6s
logparserstructured-1570786196291-exec-1516 0/1 Terminating 0 5s
logparserstructured-1570786196291-exec-1518 0/1 Terminating 0 4s
logparserstructured-1570786196291-exec-1519 0/1 Terminating 0 3s
logparserstructured-1570786196291-exec-1520 0/1 Terminating 0 3s
logparserstructured-1570786196291-exec-1521 0/1 Terminating 0 2s
logparserstructured-1570786196291-exec-1522 0/1 Terminating 0 2s
logparserstructured-1570786196291-exec-1523 0/1 Terminating 0 1s
logparserstructured-1570786196291-exec-1524 0/1 Terminating 0 1s
logparserstructured-1570786469589-exec-1487 0/1 Terminating 0 7s
logparserstructured-1570786469589-exec-1489 0/1 Terminating 0 6s
logparserstructured-1570786469589-exec-1491 0/1 Terminating 0 5s
logparserstructured-1570786469589-exec-1493 0/1 Terminating 0 4s
logparserstructured-1570786469589-exec-1494 0/1 Terminating 0 3s
logparserstructured-1570786469589-exec-1495 0/1 Terminating 0 2s
logparserstructured-1570786469589-exec-1496 0/1 Terminating 0 2s
logparserstructured-1570786469589-exec-1497 0/1 Terminating 0 1s
logparserstructured-1570786469589-exec-1498 0/1 Terminating 0 1s
logparserstructured-1570786469589-exec-1499 0/1 Terminating 0 0s
logparserstructured-1570786469589-exec-1500 0/1 Terminating 0 0s
logparserstructured-1570787682660-exec-1293 0/1 Terminating 0 8s
logparserstructured-1570787682660-exec-1299 0/1 Terminating 0 5s
logparserstructured-1570787682660-exec-1300 0/1 Terminating 0 5s
logparserstructured-1570787682660-exec-1301 0/1 Terminating 0 4s
logparserstructured-1570787682660-exec-1303 0/1 Terminating 0 3s
logparserstructured-1570787682660-exec-1305 0/1 Terminating 0 2s
logparserstructured-1570787682660-exec-1306 0/1 Terminating 0 2s
logparserstructured-1570787682660-exec-1307 0/1 Terminating 0 1s
logparserstructured-1570787682660-exec-1308 0/1 Terminating 0 1s

Checked driver pod logs:

19/10/11 09:54:51 INFO Utils: Successfully started service 'SparkUI' on port 4040.
19/10/11 09:54:51 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://logparserstructured-1570787682660-driver-svc.spark.svc:4040
19/10/11 09:54:51 INFO SparkContext: Added JAR s3://datastore/source-code/log-analytics-assembly-2.4.0.jar at s3://datastore/source-code/log-analytics-assembly-2.4.0.jar with timestamp 1570787691192
19/10/11 09:54:52 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.
19/10/11 09:54:52 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
19/10/11 09:54:52 INFO NettyBlockTransferService: Server created on logparserstructured-1570787682660-driver-svc.spark.svc:7079
19/10/11 09:54:52 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
19/10/11 09:54:52 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, logparserstructured-1570787682660-driver-svc.spark.svc, 7079, None)
19/10/11 09:54:52 INFO BlockManagerMasterEndpoint: Registering block manager logparserstructured-1570787682660-driver-svc.spark.svc:7079 with 2004.6 MB RAM, BlockManagerId(driver, logparserstructured-1570787682660-driver-svc.spark.svc, 7079, None)
19/10/11 09:54:52 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, logparserstructured-1570787682660-driver-svc.spark.svc, 7079, None)
19/10/11 09:54:52 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, logparserstructured-1570787682660-driver-svc.spark.svc, 7079, None)
19/10/11 09:54:52 ERROR Utils: Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://kubernetes.default.svc/api/v1/namespaces/spark/pods. Message: pods "logparserstructured-1570787682660-exec-1" already exists. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=null, kind=pods, name=logparserstructured-1570787682660-exec-1, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=pods "logparserstructured-1570787682660-exec-1" already exists, metadata=ListMeta(_continue=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=AlreadyExists, status=Failure, additionalProperties={}).
 at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:503)
 at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:442)
 at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:406)
 at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:365)
 at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:234)
 at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:796)
 at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:326)
 at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:322)
 at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$org$apache$spark$scheduler$cluster$k8s$ExecutorPodsAllocator$$onNewSnapshots$1.apply$mcVI$sp(ExecutorPodsAllocator.scala:139)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
 at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsAllocator$$onNewSnapshots(ExecutorPodsAllocator.scala:126)
 at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$start$1.apply(ExecutorPodsAllocator.scala:68)
 at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$start$1.apply(ExecutorPodsAllocator.scala:68)
 at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$$anonfun$org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$$callSubscriber$1.apply$mcV$sp(ExecutorPodsSnapshotsStoreImpl.scala:102)
 at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)
 at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$$callSubscriber(ExecutorPodsSnapshotsStoreImpl.scala:99)
 at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$$anonfun$addSubscriber$1.apply$mcV$sp(ExecutorPodsSnapshotsStoreImpl.scala:71)
 at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$$anon$1.run(ExecutorPodsSnapshotsStoreImpl.scala:107)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)

9/10/11 09:55:37 INFO BlockManagerMasterEndpoint: Trying to remove executor 8 from BlockManagerMaster.
19/10/11 09:55:37 INFO BlockManagerMaster: Removal of executor 8 requested
19/10/11 09:55:37 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 8
19/10/11 09:55:43 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
19/10/11 09:55:58 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
19/10/11 09:56:00 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
19/10/11 09:56:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 9 from BlockManagerMaster.
19/10/11 09:56:02 INFO BlockManagerMaster: Removal of executor 9 requested
19/10/11 09:56:02 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 9
19/10/11 09:56:02 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
19/10/11 09:56:04 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
19/10/11 09:56:05 INFO BlockManagerMasterEndpoint: Trying to remove executor 10 from BlockManagerMaster.
19/10/11 09:56:05 INFO BlockManagerMaster: Removal of executor 10 requested
19/10/11 09:56:05 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 10
19/10/11 09:56:06 INFO BlockManagerMaster: Removal of executor 11 requested
19/10/11 09:56:06 INFO BlockManagerMasterEndpoint: Trying to remove executor 11 from BlockManagerMaster.
19/10/11 09:56:06 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 11
19/10/11 09:56:13 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
19/10/11 09:56:22 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
19/10/11 09:56:24 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
19/10/11 09:56:25 INFO BlockManagerMasterEndpoint: Trying to remove executor 12 from BlockManagerMaster.
19/10/11 09:56:25 INFO BlockManagerMaster: Removal of executor 12 requested
19/10/11 09:56:25 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 12
19/10/11 09:56:26 INFO BlockManagerMasterEndpoint: Trying to remove executor 13 from BlockManagerMaster.
19/10/11 09:56:26 INFO BlockManagerMaster: Removal of executor 13 requested

Spark-on-Kubernetes: driver pod keeps spinning/terminating executor pods

Details

Description

Attachments

Activity

People

Dates