Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-48847

Resubmitting a stage without verifying the stage attempt number may result in an infinite loop

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.2.2, 3.4.2, 3.3.2, 3.5.1
    • None
    • Scheduler
    • None

    Description

      In org.apache.spark.scheduler.DAGScheduler#processShuffleMapStageCompletion

      private def processShuffleMapStageCompletion(shuffleStage: ShuffleMapStage): Unit = {
          // some code ... 
          
          if (!shuffleStage.isAvailable) {
            // Some tasks had failed; let's resubmit this shuffleStage.
            // TODO: Lower-level scheduler should also deal with this
            logInfo(log"Resubmitting ${MDC(STAGE, shuffleStage)} " +
              log"(${MDC(STAGE_NAME, shuffleStage.name)}) " +
              log"because some of its tasks had failed: " +
              log"${MDC(PARTITION_IDS, shuffleStage.findMissingPartitions().mkString(", "))}")
            submitStage(shuffleStage) // resubmit without check
          } else {
            markMapStageJobsAsFinished(shuffleStage)
            submitWaitingChildStages(shuffleStage)
          } 
      }

      The code above shows that the DAGScheduler will resubmit the stage directly without checking if the stage attempt number is greater than maxConsecutiveStageAttempts.  However, resubmitting the stage may still lead to failure or the stage may continually fail, causing an infinite loop.

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            jiang13021 jiang13021
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: