Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Cannot Reproduce
-
1.7.2
-
None
-
None
-
Problem only exists when Kerberos is turned on.
Description
If an Accumulo client tries to send an RPC to a tserver but the client's token is expired, it will get stuck in an infinite loop here.
I'm setting the priority to "minor" because it's actually pretty difficult to put the system into this state: you have to create the client with a valid token, let the token expire, and then try to use the client. We hit this by accident in the cleanup phase of a very long-running MR job; the workaround (a.k.a the right way to do it) is to create a new client instead of re-using an old client.
On the tserver side, we get an exception like this every 100ms:
java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: Peer indicated failure: GSS initiate failed at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:51) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:48) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory.getTransport(UGIAssumingTransportFactory.java:48) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:208) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) at java.lang.Thread.run(Thread.java:745)
On the client side, no output is produced unless debug logging is turned on for o.a.a.core.client.impl.ServerClient, in which case you see a bunch of "Failed to find TGT" errors.
I'm not sure about the best way to fix it, advice is welcome, but I'm thinking that a binary exponential backoff (maybe capped at 30s?) instead of a retry every 100ms would at least lighten the load on the tservers?