Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
There have been a bunch of NodeManager health checker improvement requests in the past.
Right now, I expect that initially there just need to be a bunch of base functionality added. The most obvious parts are:
- Finding appropriate measurements of health
- Storing measurements as metrics. This should allow easy comparison of good nodes and bad nodes. This should eventually lead to threshold blacklisting/whitelisting.
- Adding metrics to the NodeManager UI
After this basic functionality is added, we can start consider some enhanced form of NodeManager health status conditions.
Attachments
1.
|
Make the NodeManager's health checker service pluggable | Open | Raghav Mohan | |
2.
|
NodeManager not blacklisting the disk (shuffle) with errors | Open | Unassigned | |
3.
|
Make "good" local directories available to ContainerExecutors at initialization time | Open | Sidharta Seethana | |
4.
|
Need a default NodeManager health check script | Resolved | Yufei Gu | |
5.
|
"Health-Report" column of NodePage should display more information. | Open | Unassigned | |
6.
|
NM disk health checker should have a timeout | Patch Available | Akihiro Suda | |
7.
|
Make DiskChecker pluggable in NodeManager | Resolved | Yufei Gu | |
8.
|
Use smartctl to determine health of disks | Open | Unassigned | |
9.
|
Create new DiskValidator class with metrics | Resolved | Yufei Gu | |
10.
|
Define exit code for allowing NodeManager health script to mar | Resolved | Yufei Gu | |
11.
|
Better handling when bad script is configured as Node's HealthScript | Resolved | Unassigned | |
12.
|
Add default value for NM disk validator | Resolved | Yufei Gu |