Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-7163

RMContext need not to be injected to webapp and other Always Running services.

    XMLWordPrintableJSON

Details

    Description

      It is observed that RM crashes with heap space OOM in secure cluster(http authentication is kerborse) when RM HA is enabled.
      Scenario is
      1. Start RM in HA secure mode. Lets say RM1 is active mode.
      2. Run many applications so that it uses greater than 50% of heap space configured. Lets say, if heap space is 2GB, then run applications that occupy 1.5GB of heap space.
      3. Switch RM to StandBy and bring back to Active! While recovering applications from state store, RM crashes with OOM.
      Note : This issue will happen only when RM is started as ACTIVE directly. (not switched from standby to active during start of JVM)

      Heap dump shows that RMAuthenticationFilter holds 60% heap space! And other 40% held by RMAppState which is during recovering from state store. This exceeds the heap space and crashes with OOM.

      Attachments

        1. suspect-2.png
          98 kB
          Rohith Sharma K S
        2. suspect-1.png
          155 kB
          Rohith Sharma K S
        3. YARN-7163.01.patch
          6 kB
          Rohith Sharma K S
        4. YARN-7163.02.patch
          7 kB
          Rohith Sharma K S
        5. YARN-7163.03.patch
          50 kB
          Rohith Sharma K S
        6. YARN-7163.03.patch
          53 kB
          Rohith Sharma K S
        7. YARN-7163-branch-2.01.patch
          58 kB
          Rohith Sharma K S
        8. YARN-7163-branch-2.addednum.patch
          1 kB
          Rohith Sharma K S

        Issue Links

          Activity

            People

              rohithsharma Rohith Sharma K S
              rohithsharma Rohith Sharma K S
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: