[YARN-5078] [Umbrella] NodeManager health checker improvements - ASF JIRA

XML

Word

Printable

JSON

There have been a bunch of NodeManager health checker improvement requests in the past.

Right now, I expect that initially there just need to be a bunch of base functionality added. The most obvious parts are:

Finding appropriate measurements of health
Storing measurements as metrics. This should allow easy comparison of good nodes and bad nodes. This should eventually lead to threshold blacklisting/whitelisting.
Adding metrics to the NodeManager UI

After this basic functionality is added, we can start consider some enhanced form of NodeManager health status conditions.

1.	Make the NodeManager's health checker service pluggable	Open	Raghav Mohan
2.	NodeManager not blacklisting the disk (shuffle) with errors	Open	Unassigned
3.	Make "good" local directories available to ContainerExecutors at initialization time	Open	Sidharta Seethana
4.	"Health-Report" column of NodePage should display more information.	Open	Unassigned
5.	NM disk health checker should have a timeout	Patch Available	Akihiro Suda
6.	Use smartctl to determine health of disks	Open	Unassigned