Details
-
Bug
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
1.7.2, 2.4.17
-
None
-
None
Description
Observed this in production environment running some version of 1.7 release.
Application didn't had the right keytab setup for authentication. Application was trying to create HConnection and zookeeper server threw AUTH_FAILED exception.
After few hours of application in this state, saw thousands of zk-event-processor thread with below stack trace.
"zk-event-processor-pool1-t1" #1275 daemon prio=5 os_prio=0 cpu=1.04ms elapsed=41794.58s tid=0x00007fd7805066d0 nid=0x1245 waiting on condition [0x00007fd75df01000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@11.0.18.0.102/Native Method) - parking to wait for <0x00007fd9874a85e0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(java.base@11.0.18.0.102/LockSupport.java:194) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.18.0.102/AbstractQueuedSynchronizer.java:2081) at java.util.concurrent.LinkedBlockingQueue.take(java.base@11.0.18.0.102/LinkedBlockingQueue.java:433) at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.18.0.102/ThreadPoolExecutor.java:1054) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.18.0.102/ThreadPoolExecutor.java:1114) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.18.0.102/ThreadPoolExecutor.java:628)
HConnectionImplementation(Configuration conf, boolean managed, ExecutorService pool, User user, String clusterId) throws IOException { ... ... try { this.registry = setupRegistry(); retrieveClusterId(); ... ... } catch (Throwable e) { // avoid leaks: registry, rpcClient, ... LOG.debug("connection construction failed", e); close(); throw e; }
retrieveClusterId internally calls ZKConnectionRegistry#getClusterId
private String clusterId = null; @Override public String getClusterId() { if (this.clusterId != null) return this.clusterId; // No synchronized here, worse case we will retrieve it twice, that's // not an issue. try (ZooKeeperKeepAliveConnection zkw = hci.getKeepAliveZooKeeperWatcher()) { this.clusterId = ZKClusterId.readClusterIdZNode(zkw); if (this.clusterId == null) { LOG.info("ClusterId read in ZooKeeper is null"); } } catch (KeeperException | IOException e) { ---> WE ARE SWALLOWING THIS EXCEPTION AND RETURNING NULL. LOG.warn("Can't retrieve clusterId from Zookeeper", e); } return this.clusterId; }
ZkConnectionRegistry#getClusterId threw the following exception.(Our logging system trims stack traces longer than 5 lines.)
Cause: org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed for /hbase/hbaseid StackTrace: org.apache.zookeeper.KeeperException.create(KeeperException.java:126) org.apache.zookeeper.KeeperException.create(KeeperException.java:54) org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1213) org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:285) org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:470)
We should throw KeeperException from ZKConnectionRegistry#getClusterId all the way back to HConnectionImplementation constructor to close all the watcher threads and throw the exception back to the caller.
Attachments
Issue Links
- links to