Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
1.4.1
-
None
-
branch-1.4 #8dfdca46dd2f527bf653ea96777b23652bc4eb83
Description
Hello,
I have just started using start-mesos-dispatcher and have been noticing that some random crashes NPE's
By looking at the exception it looks like in certain situations the "queuedDrivers" is empty and causes the NPE "submission.cores"
log
15/07/30 23:56:44 INFO MesosRestServer: Started REST server for submitting applications on port 7077 Exception in thread "Thread-1647" java.lang.NullPointerException at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:437) at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:436) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:436) at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:512) I0731 00:53:52.969518 7014 sched.cpp:1625] Asked to abort the driver I0731 00:53:52.969895 7014 sched.cpp:861] Aborting framework '20150730-234528-4261456064-5050-61754-0000' 15/07/31 00:53:52 INFO MesosClusterScheduler: driver.run() returned with code DRIVER_ABORTED
A side effect of this NPE is that after the crash the dispatcher will not start because its already registered #SPARK-7831
log
15/07/31 09:55:47 INFO MesosClusterUI: Started MesosClusterUI at http://192.168.0.254:8081 I0731 09:55:47.715039 8162 sched.cpp:157] Version: 0.23.0 I0731 09:55:47.717013 8163 sched.cpp:254] New master detected at master@192.168.0.254:5050 I0731 09:55:47.717381 8163 sched.cpp:264] No credentials provided. Attempting to register without authentication I0731 09:55:47.718246 8177 sched.cpp:819] Got error 'Completed framework attempted to re-register' I0731 09:55:47.718268 8177 sched.cpp:1625] Asked to abort the driver 15/07/31 09:55:47 ERROR MesosClusterScheduler: Error received: Completed framework attempted to re-register I0731 09:55:47.719091 8177 sched.cpp:861] Aborting framework '20150730-234528-4261456064-5050-61754-0038' 15/07/31 09:55:47 INFO MesosClusterScheduler: driver.run() returned with code DRIVER_ABORTED 15/07/31 09:55:47 INFO Utils: Shutdown hook called
I can get around this by removing the zk data:
zkCli.sh
rmr /spark_mesos_dispatcher