Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.2.2, 3.4.2, 3.3.2, 3.5.1
-
None
-
None
Description
In org.apache.spark.scheduler.DAGScheduler#processShuffleMapStageCompletion
private def processShuffleMapStageCompletion(shuffleStage: ShuffleMapStage): Unit = { // some code ... if (!shuffleStage.isAvailable) { // Some tasks had failed; let's resubmit this shuffleStage. // TODO: Lower-level scheduler should also deal with this logInfo(log"Resubmitting ${MDC(STAGE, shuffleStage)} " + log"(${MDC(STAGE_NAME, shuffleStage.name)}) " + log"because some of its tasks had failed: " + log"${MDC(PARTITION_IDS, shuffleStage.findMissingPartitions().mkString(", "))}") submitStage(shuffleStage) // resubmit without check } else { markMapStageJobsAsFinished(shuffleStage) submitWaitingChildStages(shuffleStage) } }
The code above shows that the DAGScheduler will resubmit the stage directly without checking if the stage attempt number is greater than maxConsecutiveStageAttempts. However, resubmitting the stage may still lead to failure or the stage may continually fail, causing an infinite loop.