Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.4.4
-
None
-
EKS version: v1.14.6-eks-5047ed with 18 x r5a.12xlarge EC2 instances as Worker nodes
Spark version: 2.4.4
Kubernetes client version: 4.4.2
Description
We are running Spark on Kubernetes (EKS), there are some Spark jobs which driver pod keep creating and terminating executor pods. Executor pods only live for few seconds. Some jobs have been stuck for days, and spun up to thousand executor pods.
➜ kubectl get pods --field-selector=status.phase=Running -n spark NAME READY STATUS RESTARTS AGE log-parser-k8s.structured-log.growth-referral-late-1570797184638 1/1 Running 0 111m log-parser-k8s.structured-log.growth-referral-late-1570798665864 1/1 Running 0 87m log-parser-k8s.structured-log.growth-referral-late-1570801912533 1/1 Running 0 33m log-parser-k8s.structured-log.growth-referral-late-1570803823618 1/1 Running 0 77s log-parser-k8s.structured-log.orbital-latest-struc-1570797051892 1/1 Running 0 114m log-parser-k8s.structured-log.orbital-latest-struc-1570803823552 1/1 Running 0 77s logparserstructured-1570797185983-exec-1727 1/1 Running 0 4s logparserstructured-1570801913787-exec-431 1/1 Running 0 4s logparserstructured-1570803824926-exec-1 1/1 Running 0 69s logparserstructured-1570803824926-exec-2 1/1 Running 0 69s
Run kubectl get pods, we can see bunch of executor pods are being terminated.
logparserstructured-1570786196291-exec-1511 0/1 Terminating 0 7s logparserstructured-1570786196291-exec-1513 0/1 Terminating 0 6s logparserstructured-1570786196291-exec-1516 0/1 Terminating 0 5s logparserstructured-1570786196291-exec-1518 0/1 Terminating 0 4s logparserstructured-1570786196291-exec-1519 0/1 Terminating 0 3s logparserstructured-1570786196291-exec-1520 0/1 Terminating 0 3s logparserstructured-1570786196291-exec-1521 0/1 Terminating 0 2s logparserstructured-1570786196291-exec-1522 0/1 Terminating 0 2s logparserstructured-1570786196291-exec-1523 0/1 Terminating 0 1s logparserstructured-1570786196291-exec-1524 0/1 Terminating 0 1s logparserstructured-1570786469589-exec-1487 0/1 Terminating 0 7s logparserstructured-1570786469589-exec-1489 0/1 Terminating 0 6s logparserstructured-1570786469589-exec-1491 0/1 Terminating 0 5s logparserstructured-1570786469589-exec-1493 0/1 Terminating 0 4s logparserstructured-1570786469589-exec-1494 0/1 Terminating 0 3s logparserstructured-1570786469589-exec-1495 0/1 Terminating 0 2s logparserstructured-1570786469589-exec-1496 0/1 Terminating 0 2s logparserstructured-1570786469589-exec-1497 0/1 Terminating 0 1s logparserstructured-1570786469589-exec-1498 0/1 Terminating 0 1s logparserstructured-1570786469589-exec-1499 0/1 Terminating 0 0s logparserstructured-1570786469589-exec-1500 0/1 Terminating 0 0s logparserstructured-1570787682660-exec-1293 0/1 Terminating 0 8s logparserstructured-1570787682660-exec-1299 0/1 Terminating 0 5s logparserstructured-1570787682660-exec-1300 0/1 Terminating 0 5s logparserstructured-1570787682660-exec-1301 0/1 Terminating 0 4s logparserstructured-1570787682660-exec-1303 0/1 Terminating 0 3s logparserstructured-1570787682660-exec-1305 0/1 Terminating 0 2s logparserstructured-1570787682660-exec-1306 0/1 Terminating 0 2s logparserstructured-1570787682660-exec-1307 0/1 Terminating 0 1s logparserstructured-1570787682660-exec-1308 0/1 Terminating 0 1s
Checked driver pod logs:
19/10/11 09:54:51 INFO Utils: Successfully started service 'SparkUI' on port 4040. 19/10/11 09:54:51 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://logparserstructured-1570787682660-driver-svc.spark.svc:4040 19/10/11 09:54:51 INFO SparkContext: Added JAR s3://datastore/source-code/log-analytics-assembly-2.4.0.jar at s3://datastore/source-code/log-analytics-assembly-2.4.0.jar with timestamp 1570787691192 19/10/11 09:54:52 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes. 19/10/11 09:54:52 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079. 19/10/11 09:54:52 INFO NettyBlockTransferService: Server created on logparserstructured-1570787682660-driver-svc.spark.svc:7079 19/10/11 09:54:52 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 19/10/11 09:54:52 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, logparserstructured-1570787682660-driver-svc.spark.svc, 7079, None) 19/10/11 09:54:52 INFO BlockManagerMasterEndpoint: Registering block manager logparserstructured-1570787682660-driver-svc.spark.svc:7079 with 2004.6 MB RAM, BlockManagerId(driver, logparserstructured-1570787682660-driver-svc.spark.svc, 7079, None) 19/10/11 09:54:52 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, logparserstructured-1570787682660-driver-svc.spark.svc, 7079, None) 19/10/11 09:54:52 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, logparserstructured-1570787682660-driver-svc.spark.svc, 7079, None) 19/10/11 09:54:52 ERROR Utils: Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://kubernetes.default.svc/api/v1/namespaces/spark/pods. Message: pods "logparserstructured-1570787682660-exec-1" already exists. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=null, kind=pods, name=logparserstructured-1570787682660-exec-1, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=pods "logparserstructured-1570787682660-exec-1" already exists, metadata=ListMeta(_continue=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=AlreadyExists, status=Failure, additionalProperties={}). at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:503) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:442) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:406) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:365) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:234) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:796) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:326) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:322) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$org$apache$spark$scheduler$cluster$k8s$ExecutorPodsAllocator$$onNewSnapshots$1.apply$mcVI$sp(ExecutorPodsAllocator.scala:139) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsAllocator$$onNewSnapshots(ExecutorPodsAllocator.scala:126) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$start$1.apply(ExecutorPodsAllocator.scala:68) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$start$1.apply(ExecutorPodsAllocator.scala:68) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$$anonfun$org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$$callSubscriber$1.apply$mcV$sp(ExecutorPodsSnapshotsStoreImpl.scala:102) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$$callSubscriber(ExecutorPodsSnapshotsStoreImpl.scala:99) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$$anonfun$addSubscriber$1.apply$mcV$sp(ExecutorPodsSnapshotsStoreImpl.scala:71) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$$anon$1.run(ExecutorPodsSnapshotsStoreImpl.scala:107) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
9/10/11 09:55:37 INFO BlockManagerMasterEndpoint: Trying to remove executor 8 from BlockManagerMaster. 19/10/11 09:55:37 INFO BlockManagerMaster: Removal of executor 8 requested 19/10/11 09:55:37 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 8 19/10/11 09:55:43 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 19/10/11 09:55:58 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 19/10/11 09:56:00 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes. 19/10/11 09:56:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 9 from BlockManagerMaster. 19/10/11 09:56:02 INFO BlockManagerMaster: Removal of executor 9 requested 19/10/11 09:56:02 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 9 19/10/11 09:56:02 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes. 19/10/11 09:56:04 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes. 19/10/11 09:56:05 INFO BlockManagerMasterEndpoint: Trying to remove executor 10 from BlockManagerMaster. 19/10/11 09:56:05 INFO BlockManagerMaster: Removal of executor 10 requested 19/10/11 09:56:05 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 10 19/10/11 09:56:06 INFO BlockManagerMaster: Removal of executor 11 requested 19/10/11 09:56:06 INFO BlockManagerMasterEndpoint: Trying to remove executor 11 from BlockManagerMaster. 19/10/11 09:56:06 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 11 19/10/11 09:56:13 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 19/10/11 09:56:22 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes. 19/10/11 09:56:24 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes. 19/10/11 09:56:25 INFO BlockManagerMasterEndpoint: Trying to remove executor 12 from BlockManagerMaster. 19/10/11 09:56:25 INFO BlockManagerMaster: Removal of executor 12 requested 19/10/11 09:56:25 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 12 19/10/11 09:56:26 INFO BlockManagerMasterEndpoint: Trying to remove executor 13 from BlockManagerMaster. 19/10/11 09:56:26 INFO BlockManagerMaster: Removal of executor 13 requested