Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
None
-
None
Description
NIFI-7920 fixed a bug that can result in nodes getting the wrong Revision for some components. The fix for that, however, appears to have caused a regression. When a Node is disconnected due to failing to service a replicated API request, such as a component being stopped/started/moved, it will now unregister from leader election for Primary Node / Cluster Coordinator. However, if it then reconnects, it does not re-register for the roles. As a result, we can have a situation where a node disconnects and reconnects and never is able to become Cluster Coordinator. If this happens to all nodes in a cluster, we can end up where no nodes are eligible to become Cluster Coordinator. This results in logs such as:
2021-02-03 20:14:55,167 WARN [Clustering Tasks Thread-3] o.apache.nifi.controller.FlowController Failed to send heartbeat due to: java.lang.IllegalArgumentException: Cannot send heartbeat to address []. Address must be in <hostname>:<port> format
And errors in the UI stating:
Action cannot be performed because there is currently no Cluster Coordinator elected. The request should be tried again after a moment, after a Cluster Coordinator has been automatically elected.. Returning Service Unavailable response.
At this point, there will never be a cluster coordinator until nodes are restarted.
Attachments
Issue Links
- links to