Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
While further testing TEZ-4236, I discovered a strange hang in AsyncDispatcher. The symptom was a BLOCKED state event-handling thread. This block is most likely caused by the dispatcher itself, so instead of making yet another sync hack in the dispatcher, or even DrainDispatcher (however they have a common codebase for serviceStop at the moment), I'm about to put a safety check, which solved my issue.
jstack of a hang (it hangs here)
"Thread-1212" #1581 prio=5 os_prio=31 tid=0x00007fdb4c628800 nid=0x1b84f in Object.wait() [0x000070000997b000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1252) - locked <0x00000007ae967340> (a java.lang.Thread) at java.lang.Thread.join(Thread.java:1326) at org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:157) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220) - locked <0x00000007ae642cc8> (a java.lang.Object) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:67) at org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1933) at org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:2155) - locked <0x00000007ae53d3c0> (a org.apache.tez.dag.app.LocalDAGAppMaster) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220) - locked <0x00000007ae53d5f0> (a java.lang.Object) at org.apache.tez.client.LocalClient.close(LocalClient.java:135) at org.apache.tez.client.TezClient.stop(TezClient.java:775) - locked <0x00000007ade485a0> (a org.apache.tez.client.TezClient) at org.apache.tez.test.TestLocalMode.testMultipleClientsWithSession(TestLocalMode.java:147) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
during the investigation I added some log messages, that lead to this solution:
1. there is not a single event in the queue
2. event handling thread is in BLOCKED state
3. AM is already shut down, but test hangs as LocalClient.stop waits to AM services to be stopped
2021-01-29 16:16:44,350 INFO [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(148)) - AsyncDispatcher is draining to stop, ignoring any new events. ... 2021-01-29 16:16:45,354 INFO [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED 2021-01-29 16:16:46,358 INFO [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED 2021-01-29 16:16:47,362 INFO [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED ... 2021-01-29 16:16:48,365 INFO [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED 2021-01-29 16:16:49,349 INFO [AMShutdownThread] app.DAGAppMaster (DAGAppMaster.java:run(1002)) - Calling stop for all the services 2021-01-29 16:16:49,349 INFO [AMShutdownThread] app.DAGAppMaster (DAGAppMaster.java:run(1013)) - Exiting DAGAppMaster..GoodBye! 2021-01-29 16:16:49,370 INFO [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED 2021-01-29 16:16:49,370 WARN [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(153)) - Cannot wait for drain events anymore, logging current events if any... 2021-01-29 16:16:49,370 INFO [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:logCurrentEvents(186)) - Current events in the queue: 0 2021-01-29 16:16:49,371 INFO [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(168)) - interrupting event handling thread, state: BLOCKED 2021-01-29 16:16:49,371 INFO [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(170)) - interrupted event handling thread, state: BLOCKED 2021-01-29 16:16:49,371 INFO [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(172)) - joining event handling thread, state: BLOCKED
under normal circumstances the event handling thread is in WAITING state (which I think is because of queue.take)
+ I also included the timeout logic of YARN-3999 here
Attachments
Issue Links
- Discovered while testing
-
TEZ-4236 DAGClientServer is not really needed to be started/used in local mode
- Resolved