Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4955

NM container diagnostics for excess resource usage can be lost if task fails while being killed

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.0.3-alpha, 0.23.5
    • None
    • mr-am
    • None

    Description

      When a nodemanager kills a container for being over resource budgets, it provides a diagnostics message for the container status explaining why it was killed. However this message can be lost if the task fails during the shutdown from the SIGTERM (e.g.: lost DFS leases because filesystem closed) and notifies the AM via the task umbilical before the AM receives the NM's container status message via the RM heartbeat.

      In that case the task attempt fails with the task's failure diagnostic, and the user is left wondering exactly why the task failed because the NM's diagnostics arrive too late, are not written to the history file, and are lost. If the AM receives the container status via the RM heartbeat before the task fails during shutdown then the diagnostics are written properly to the history file, and the user can see why the task failed.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jlowe Jason Darrell Lowe
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: