Uploaded image for project: 'Aurora'
  1. Aurora
  2. AURORA-1945

Rescinds received but not processed in time before offer accept

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 0.19.0
    • Scheduler
    • None

    Description

      The current race condition for offers is possible:

      1. Scheduler receives an offer and adds it to the executor queue for processing.
      2. The executor processes the offer and adds it to the HostOffers list.
      3. Scheduler receives a rescind for that offer and adds it to the executor queue for processing. However, there is a lot of load on the executor so there might be a delay between receiving the rescind and processing it.
      4. Scheduler accepts the offer before the rescind is processed by the executor. This will result in launching a task with an invalid offer leading to TASK_LOST.

      The following logs show this in action:

      Mesos:

      I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources...
      W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is no longer valid
      W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]': Offer OFFER_X is no longer valid
      I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
      

      Aurora:

      I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
      I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer OFFER_X with ops [LAUNCH] 
      I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task launched with invalid offers: Offer_X is no longer valid 
      I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
      W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer: OFFER_X. 
      

      We should find a way to prioritize/process rescinds immediately to avoid this delay. We should also take into account the previous race condition fixed by AURORA-1933 so we do not repeat that as well.

      Attachments

        Issue Links

          Activity

            People

              jordanly Jordan Ly
              jordanly Jordan Ly
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: