Details
-
Improvement
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
2.6.2
-
None
-
None
Description
Ambari causes Kafka topic partition outages during rolling restarts because it only does a simplistic 2 minute wait between brokers and doesn't check the state of partition replicas before taking another broker down.
On busty Kafka clusters with lots topics / partitions / data it might take a while before in-sync replicas recover.
Ambari should therefore check for any under replicated partitions and wait as long as it takes for them to recover before proceeding to the next broker. There is however an issue in doing so which is there is a topic partition with a replica that no longer exists (eg. ambari_kafka_service_check) then it will never recover so there needs to be some thoughtful handling around that.
This might be solved by AMBARI-24203 but I'm not sure it is tied in properly to the rolling restarts or what the timeout policy or time interval is for it, or whether it takes the above paragraph in to account.
This could also have been easily offset if Ambari had proper extensible checking as raised in AMBARI-24381.
Attachments
Issue Links
- relates to
-
AMBARI-24203 Improve Kafka service check to check for under-replicated partitions
- Resolved
-
AMBARI-24381 Ambari Extensible Monitoring - use Nagios Plugins format and make extensible for users to extend checking
- Open