Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
0.11.0.1
-
None
-
None
Description
We run our Kafka cluster on a cloud service provider. As a consequence, we notice a large tail latency write time that is out of our control. Some writes take on the order of seconds. We have noticed that often these long write times are correlated with subsequent Zookeeper disconnects from the brokers. It appears that during the long write time, the Zookeeper heartbeat thread does not get scheduled CPU time, resulting in a long gap of heartbeats sent. After the write, the ZK thread does get scheduled CPU time, but it detects that it has not received a heartbeat from Zookeeper in a while, so it drops its connection then rejoins the cluster.
Note that the timeout reported is inconsistent with the timeout as set by the configuration (zookeeper.session.timeout.ms = default of 6 seconds). We have seen a range of values reported here, including 5950ms (less than threshold), 12032ms (double the threshold), 25999ms (much larger than the threshold).
We noticed that during a service degradation of the storage service of our cloud provider, these Zookeeper disconnects increased drastically in frequency.
We are hoping there is a way to decouple these components. Do you agree with our diagnosis that the ZK disconnects are occurring due to thread contention caused by long disk writes? Perhaps the ZK thread could be scheduled at a higher priority? Do you have any suggestions for how to avoid the ZK disconnects?
Here is an example of one of these events:
Logs on the Broker:
[2018-08-25 04:10:19,695] DEBUG Got ping response for sessionid: 0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn) [2018-08-25 04:10:21,697] DEBUG Got ping response for sessionid: 0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn) [2018-08-25 04:10:23,700] DEBUG Got ping response for sessionid: 0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn) [2018-08-25 04:10:25,701] DEBUG Got ping response for sessionid: 0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn) [2018-08-25 04:10:27,702] DEBUG Got ping response for sessionid: 0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn) [2018-08-25 04:10:29,704] DEBUG Got ping response for sessionid: 0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn) [2018-08-25 04:10:31,707] DEBUG Got ping response for sessionid: 0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn) [2018-08-25 04:10:33,709] DEBUG Got ping response for sessionid: 0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn) [2018-08-25 04:10:35,712] DEBUG Got ping response for sessionid: 0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn) [2018-08-25 04:10:37,714] DEBUG Got ping response for sessionid: 0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn) [2018-08-25 04:10:39,716] DEBUG Got ping response for sessionid: 0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn) [2018-08-25 04:10:41,719] DEBUG Got ping response for sessionid: 0x36202ab4337002c after 1ms (org.apache.zookeeper.ClientCnxn) ... [2018-08-25 04:10:53,752] WARN Client session timed out, have not heard from server in 12032ms for sessionid 0x36202ab4337002c (org.apache.zookeeper.ClientCnxn) [2018-08-25 04:10:53,754] INFO Client session timed out, have not heard from server in 12032ms for sessionid 0x36202ab4337002c, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2018-08-25 04:10:53,920] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient) [2018-08-25 04:10:53,920] INFO Waiting for keeper state SyncConnected (org.I0Itec.zkclient.ZkClient) ...
GC logs during the same time (demonstrating this is not just a long GC):
2018-08-25T04:10:36.434+0000: 35150.779: [GC (Allocation Failure) 3074119K->2529089K(6223360K), 0.0137342 secs] 2018-08-25T04:10:37.367+0000: 35151.713: [GC (Allocation Failure) 3074433K->2528524K(6223360K), 0.0127938 secs] 2018-08-25T04:10:38.274+0000: 35152.620: [GC (Allocation Failure) 3073868K->2528357K(6223360K), 0.0131040 secs] 2018-08-25T04:10:39.220+0000: 35153.566: [GC (Allocation Failure) 3073701K->2528885K(6223360K), 0.0133247 secs] 2018-08-25T04:10:40.175+0000: 35154.520: [GC (Allocation Failure) 3074229K->2528639K(6223360K), 0.0127870 secs] 2018-08-25T04:10:41.084+0000: 35155.429: [GC (Allocation Failure) 3073983K->2530769K(6223360K), 0.0135058 secs] 2018-08-25T04:10:42.012+0000: 35156.358: [GC (Allocation Failure) 3076113K->2531772K(6223360K), 0.0165919 secs] 2018-08-25T04:10:53.737+0000: 35168.083: [GC (Allocation Failure) 3077116K->2529630K(6223360K), 0.0135440 secs] 2018-08-25T04:10:58.515+0000: 35172.860: [GC (Allocation Failure) 3074974K->2531144K(6223360K), 0.0164601 secs] 2018-08-25T04:11:03.149+0000: 35177.495: [GC (Allocation Failure) 3076488K->2531751K(6223360K), 0.0146978 secs]
The attached screenshot shows a spike in write time to our data volume at the same second. Note that this an average value, so it does not represent the max write time, only a sample of write times.