[SPARK-17929] Deadlock when AM restart and send RemoveExecutor on reset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.0.2, 2.1.0
Component/s: Spark Core
Labels:
None

Description

We fix ~~SPARK-10582~~, and add reset in CoarseGrainedSchedulerBackend.scala

  protected def reset(): Unit = synchronized {
    numPendingExecutors = 0
    executorsPendingToRemove.clear()

    // Remove all the lingering executors that should be removed but not yet. The reason might be
    // because (1) disconnected event is not yet received; (2) executors die silently.
    executorDataMap.toMap.foreach { case (eid, _) =>
      driverEndpoint.askWithRetry[Boolean](
        RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager re-registered.")))
    }
  }

but on removeExecutor also need the lock "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock, and send RPC will failed, and reset failed

    private def removeExecutor(executorId: String, reason: ExecutorLossReason): Unit = {
      logDebug(s"Asked to remove executor $executorId with reason $reason")
      executorDataMap.get(executorId) match {
        case Some(executorInfo) =>
          // This must be synchronized because variables mutated
          // in this block are read when requesting executors
          val killed = CoarseGrainedSchedulerBackend.this.synchronized {
            addressToExecutorId -= executorInfo.executorAddress
            executorDataMap -= executorId
            executorsPendingLossReason -= executorId
            executorsPendingToRemove.remove(executorId).getOrElse(false)
          }
     ...

Attachments

Issue Links

links to

[Github] Pull Request #15481 (scwf)

Activity

People

Assignee:: Fei Wang

Reporter:: Weizhong

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 14/Oct/16 02:57

Updated:: 21/Oct/16 21:45

Resolved:: 21/Oct/16 21:45