Description
When trying to run the Random Walk with LongEach.xml module, I hit a failure once we reach the Shard.xml step:
16 19:52:05,146 [randomwalk.Framework] ERROR: Error during random walk java.lang.Exception: Error running node Shard.xml at org.apache.accumulo.test.randomwalk.Module.visit(Module.java:346) at org.apache.accumulo.test.randomwalk.Framework.run(Framework.java:59) at org.apache.accumulo.test.randomwalk.Framework.main(Framework.java:119) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.accumulo.start.Main$2.run(Main.java:157) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.Exception: Error running node shard.BulkInsert at org.apache.accumulo.test.randomwalk.Module.visit(Module.java:346) at org.apache.accumulo.test.randomwalk.Module$1.call(Module.java:283) at org.apache.accumulo.test.randomwalk.Module$1.call(Module.java:278) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) ... 1 more Caused by: java.lang.Exception: Failed to run map/red verify at org.apache.accumulo.test.randomwalk.shard.BulkInsert.sort(BulkInsert.java:186) at org.apache.accumulo.test.randomwalk.shard.BulkInsert.visit(BulkInsert.java:132) ... 9 more
Digging into YARN to see why the MR job became unhappy, I see the following:
Error: java.lang.ClassNotFoundException: org.apache.commons.math.stat.descriptive.SummaryStatistics at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at org.apache.accumulo.core.file.rfile.RFile$Writer.<init>(RFile.java:310) at org.apache.accumulo.core.file.rfile.RFileOperations.openWriter(RFileOperations.java:127) at org.apache.accumulo.core.file.rfile.RFileOperations.openWriter(RFileOperations.java:106) at org.apache.accumulo.core.file.DispatchingFileFactory.openWriter(DispatchingFileFactory.java:78) at org.apache.accumulo.core.client.mapreduce.AccumuloFileOutputFormat$1.write(AccumuloFileOutputFormat.java:172) at org.apache.accumulo.core.client.mapreduce.AccumuloFileOutputFormat$1.write(AccumuloFileOutputFormat.java:152) at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:558) at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105) at org.apache.hadoop.mapreduce.Reducer.reduce(Reducer.java:150) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
It looks like this commit introduced a dependency on the commons-math JAR at runtime (in the RFiles Writer class), but tests weren't updated to ensure that the same dependency would be put onto the classpath of MR jobs submitted by Random Walk.
Props to busbey for helping to figure out the root cause here. On a separate note, we may want to start running this test before releases, as it appears this regression also snuck into 1.8.0 and at least one 1.6 release (though, since I don't have any easy way to test this against non-1.7.2 cluster, I'm limiting the affects versions to what I've confirmed myself). Ping kturner, who might know the simplest way to fix this.