XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • None
    • None
    • nodemanager

    Description

      The disk health checker verifies a disk by executing mkdir and rmdir periodically.
      If these operations does not return in a moderate timeout, the disk should be marked bad, and thus nodeInfo.nodeHealthy should flip to false.

      I confirmed that current YARN does not have an implicit timeout (on JDK7, Linux 4.2, ext4) using Earthquake, our fault injector for distributed systems.
      (I'll introduce the reproduction script in a while)

      I consider we can fix this issue by making NodeHealthCheckerServer.isHealthy() return false if the value of this.getLastHealthReportTime() is too old.

      Attachments

        1. YARN-4301-3-fail.patch
          13 kB
          Akihiro Suda
        2. YARN-4301-2.patch
          10 kB
          Akihiro Suda
        3. YARN-4301-1.patch
          7 kB
          Akihiro Suda
        4. concept-async-diskchecker.txt
          3 kB
          Akihiro Suda

        Issue Links

          Activity

            People

              suda Akihiro Suda
              suda Akihiro Suda
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated: