Description
- I did the following
./solr start -e cloud -noprompt kill -9 <pid-of-node2> //Not the node which is running ZK
- /live_nodes reflects that the node is gone.
- This is the only message which gets logged on the node1 server after killing node2
45812 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:9983] WARN org.apache.zookeeper.server.NIOServerCnxn – caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x14ac40f26660001, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:745)
- The graph shows the node2 as 'Gone' state
- clusterstate.json keeps showing the replica as 'active'
{"collection1":{ "shards":{"shard1":{ "range":"80000000-7fffffff", "state":"active", "replicas":{ "core_node1":{ "state":"active", "core":"collection1", "node_name":"169.254.113.194:8983_solr", "base_url":"http://169.254.113.194:8983/solr", "leader":"true"}, "core_node2":{ "state":"active", "core":"collection1", "node_name":"169.254.113.194:8984_solr", "base_url":"http://169.254.113.194:8984/solr"}}}}, "maxShardsPerNode":"1", "router":{"name":"compositeId"}, "replicationFactor":"1", "autoAddReplicas":"false", "autoCreated":"true"}}
One immediate problem I can see is that AutoAddReplicas doesn't work since the clusterstate.json never changes. There might be more features which are affected by this.
On first thought I think we can handle this - The shard leader could listen to changes on /live_nodes and if it has replicas that were on that node, mark it as 'down' in the clusterstate.json?