Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
There have been a bunch of NodeManager health checker improvement requests in the past.
Right now, I expect that initially there just need to be a bunch of base functionality added. The most obvious parts are:
- Finding appropriate measurements of health
- Storing measurements as metrics. This should allow easy comparison of good nodes and bad nodes. This should eventually lead to threshold blacklisting/whitelisting.
- Adding metrics to the NodeManager UI
After this basic functionality is added, we can start consider some enhanced form of NodeManager health status conditions.
Attachments
1.
|
Make the NodeManager's health checker service pluggable | Open | Raghav Mohan | |
2.
|
NodeManager not blacklisting the disk (shuffle) with errors | Open | Unassigned | |
3.
|
Make "good" local directories available to ContainerExecutors at initialization time | Open | Sidharta Seethana | |
4.
|
"Health-Report" column of NodePage should display more information. | Open | Unassigned | |
5.
|
NM disk health checker should have a timeout | Patch Available | Akihiro Suda | |
6.
|
Use smartctl to determine health of disks | Open | Unassigned |