Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.6.0
-
None
-
None
Description
Improve Ambari Monitoring to be extensible taking standard format Nagios Plugins (the industry standard format for extensible checks which operate across a large number of monitoring systems) and allow users to extend Ambari checks and contribute them back in to the core to improve monitoring.
I know Ambari used to use Nagios core and replaced it with custom monitoring management - I'm not suggesting to use Nagios core itself, only Nagios Plugins format for community re-use and extensibility.
Tie this in to Rolling Restarts, such that users can add extra monitoring checks at any layer.
See AMBARI-24380 where Ambari didn't check RegionServers restarted successfully before continuing to take more down. It would be quicker and easier to fix this if the framework was more extensibly engineered, and using checks that are standard format for re-use and extensibility is key as users could quickly and easily add checks in to general health monitoring or rolling restarts to stop Ambari taking down successive nodes without checking the health of prior nodes etc.
You can also find lots of 3rd party plugins that vendors or users could extend Ambari health checks with as well, such as:
https://github.com/harisekhon/nagios-plugins
Attachments
Issue Links
- is related to
-
AMBARI-24719 Kafka Rolling Restart causes outage(s) due to not checking for under replicated partitions
- Open
- relates to
-
AMBARI-24380 Ambari HBase Rolling Restart failed to check RegionServers restarted successfully, continued to take down rest of RegionServers!
- Open