Details
Description
The current implementation of KeepAliveDaemon.java will sometimes force disconnections on well behaved connections. The problem may arrise if there is a connection which goes away, and the KeepAlive send to that channel blocks while attempting to reconnect. If this reconnection takes a while, then other channels that were responding fine may get their connections broken. This happens due to the following code in KeepAliveDaemon.java:
if ((channel.getLastReceiptTimestamp() + channel.getKeepAliveTimeout() * 2) < System.currentTimeMillis())
{ or }else if ((channel.getLastReceiptTimestamp() + channel.getKeepAliveTimeout()) < System.currentTimeMillis()) {
The fact that the receipt timestamp is checked against System.currentTimeMillis() causes the code to break otherwise good connections. If a KeepAlive send (in examineChannel) for a broken channel takes longer than some good channel's KeepAliveTimeout, then the good connection gets broken.
This can, in turn, cause some pretty bad behavior in the Broker. While testing and diagnosing this problem, I could some brokers in a network of brokers stuck. The sequence of events during recovery, which get interrupted due to closing the connections, would sometimes lead to the broker hanging waiting for a receipt, such as during an addConsumer (which eventually calls syncSendWithReceipt).
I have redone the logic in KeepAliveDaemon.java (which required a small change to ReliableTransportChannel as well). This now seems to work.
I'm a bit concerned about the blocking calls, though. This may be a different issue / bug. I thought it looked like there was a mechanism to cancel outstanding receipt waiters - but, every once in a while that mechanism would not get called. This results in the broker basically getting stuck, and does not ever really recover.