Details
Description
In our downstream build environment, we're using JUnit 4.13. Recently, we discovered a truly weird test failure in TestNodeStatusUpdater.
The problem is that timeout handling has changed in Junit 4.13. See the difference between these two snippets:
4.12
@Override public void evaluate() throws Throwable { CallableStatement callable = new CallableStatement(); FutureTask<Throwable> task = new FutureTask<Throwable>(callable); threadGroup = new ThreadGroup("FailOnTimeoutGroup"); Thread thread = new Thread(threadGroup, task, "Time-limited test"); thread.setDaemon(true); thread.start(); callable.awaitStarted(); Throwable throwable = getResult(task, thread); if (throwable != null) { throw throwable; } }
4.13
@Override public void evaluate() throws Throwable { CallableStatement callable = new CallableStatement(); FutureTask<Throwable> task = new FutureTask<Throwable>(callable); ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup"); Thread thread = new Thread(threadGroup, task, "Time-limited test"); try { thread.setDaemon(true); thread.start(); callable.awaitStarted(); Throwable throwable = getResult(task, thread); if (throwable != null) { throw throwable; } } finally { try { thread.join(1); } catch (InterruptedException e) { Thread.currentThread().interrupt(); } try { threadGroup.destroy(); <---- This } catch (IllegalThreadStateException e) { // If a thread from the group is still alive, the ThreadGroup cannot be destroyed. // Swallow the exception to keep the same behavior prior to this change. } } }
The change comes from https://github.com/junit-team/junit4/pull/1517.
Unfortunately, destroying the thread group causes an issue because there are all sorts of object caching in the IPC layer. The exception is:
java.lang.IllegalThreadStateException at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867) at java.lang.Thread.init(Thread.java:402) at java.lang.Thread.init(Thread.java:349) at java.lang.Thread.<init>(Thread.java:675) at java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613) at com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163) at java.util.concurrent.ThreadPoolExecutor$Worker.<init>(ThreadPoolExecutor.java:612) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136) at org.apache.hadoop.ipc.Client.call(Client.java:1458) at org.apache.hadoop.ipc.Client.call(Client.java:1405) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) at com.sun.proxy.$Proxy81.startContainers(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) at org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251) at org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576)
Both the clientExecutor in org.apache.hadoop.ipc.Client and the client object in ProtobufRpcEngine/ProtobufRpcEngine2 are stored as long as they're needed. But since the backing thread group is destroyed in the previous test, it's no longer possible to create new threads.
A quick workaround is to stop the clients and completely clear the ClientCache in ProtobufRpcEngine before each testcase. I tried this and it solves the problem but it feels hacky. Not sure if there is a better approach.
Attachments
Attachments
Issue Links
- breaks
-
HADOOP-17315 Use shaded guava in ClientCache.java
- Resolved
- is related to
-
HADOOP-17316 Upgrade JUnit to 4.13.1
- Open
- relates to
-
MAPREDUCE-7302 Upgrading to JUnit 4.13 causes testcase TestFetcher.testCorruptedIFile() to fail
- Resolved