Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
Karaf using cellar in a clustered environment to replicated configuration updates.
Description
In a karaf cluster using cellar and more specifically cellar-config, updates of a configuration on a node is not replicated to another node.
Investigations are pointing a race condition where one node receives the ClusterConfigurationEvent before the ReplicatedMap is effectively replicated on the impacted node. Thus, the node does not store the configuration and the local version keep staled.
The race condition starts here :
and continues on another node here :
Cellar is using a ReplicatedMap (hazelcast) to propagate configurations accross cluster and the replication operation is asynchronous. Thus, if the ClusterConfigurationEvent is received before the replication finish on the target node, nothing happens and no error is dedected nor retry.
To reproduce the problem we can use breakpoints (thread ones) :
- First one to simulate a long replicate operation by adding a breakpoint on the emitting node in the class com.hazelcast.replicatedmap.impl.operation.ReplicateUpdateOperation.run()
- Second one in cellar event listener that apply the replicated configuration : org.apache.karaf.cellar.config.ConfigurationEventHandler.handle() at line:
if (!equals(clusterDictionary, localDictionary) && canDistributeConfig(localDictionary)) {
Now you update a copnfiguration on the first node. On the target node, we can see that the configuration is not updated we the event is received.