[ACCUMULO-4359] Accumulo client stuck in infinite loop when Kerberos ticket expires - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Cannot Reproduce
Affects Version/s: 1.7.2
Fix Version/s: None
Component/s: core
Labels:
None
Environment:

Problem only exists when Kerberos is turned on.

Description

If an Accumulo client tries to send an RPC to a tserver but the client's token is expired, it will get stuck in an infinite loop here.

I'm setting the priority to "minor" because it's actually pretty difficult to put the system into this state: you have to create the client with a valid token, let the token expire, and then try to use the client. We hit this by accident in the cleanup phase of a very long-running MR job; the workaround (a.k.a the right way to do it) is to create a new client instead of re-using an old client.

On the tserver side, we get an exception like this every 100ms:

java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: Peer indicated failure: GSS initiate failed
	at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)
	at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:51)
	at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:48)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:360)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637)
	at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory.getTransport(UGIAssumingTransportFactory.java:48)
	at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:208)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
	at java.lang.Thread.run(Thread.java:745)

On the client side, no output is produced unless debug logging is turned on for o.a.a.core.client.impl.ServerClient, in which case you see a bunch of "Failed to find TGT" errors.

I'm not sure about the best way to fix it, advice is welcome, but I'm thinking that a binary exponential backoff (maybe capped at 30s?) instead of a retry every 100ms would at least lighten the load on the tservers?

Attachments

Activity

People

Assignee:: Russ Weeks

Reporter:: Russ Weeks

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 06/Jul/16 16:09

Updated:: 11/Jun/19 04:57

Resolved:: 11/Jun/19 04:57