Details
Description
TestNNBench#testNNBenchCreateReadAndDelete failed couple of times in our internal jenkins build.
java.lang.AssertionError: create_write should create the file at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.apache.hadoop.hdfs.TestNNBench.testNNBenchCreateReadAndDelete(TestNNBench.java:55)
Below is my analysis for why it didn't create the file.
// Some comments here public void map(Text key, LongWritable value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { if (barrier()) { String fileName = "file_" + value; if (op.equals(OP_CREATE_WRITE)) { startTimeTPmS = System.currentTimeMillis(); doCreateWriteOp(fileName, reporter); } ... } else { output.collect(new Text("l:latemaps"), new Text("1")); } // Below are the relevant parts of barrier() method private boolean barrier() { .. // If the sleep time is greater than 0, then sleep and return ... LOG.info("Waiting in barrier for: " + sleepTime + " ms"); return retVal; } // Below are the relevant parts of the doCreateWriteOp private void doCreateWriteOp(String name, Reporter reporter) { FSDataOutputStream out; byte[] buffer = new byte[bytesToWrite]; for (long l = 0l; l < numberOfFiles; l++) { Path filePath = new Path(new Path(baseDir, dataDirName), name + "_" + l); } .... }
This file BASE_DIR/data/file_0_0 is getting created only if the map task starts before the time mentioned by startTime.
Refer the chunk which I pasted above.
map(..) --> barrier() and only if barrier() evaluates to true it will call doCreateWriteOp which will eventually create the file.
In test case, the delay value is 3 seconds as per "-startTime", "" + (Time.now() / 1000 + 3)
In this failing test case, I can see the task starting minimum 6 seconds after the test case started.
2017-01-27 03:11:15,387 INFO [Thread-4] mapreduce.JobSubmitter (JobSubmitter.java:printTokens(289)) - Submitting tokens for job: job_local1711545156_0001 2017-01-27 03:11:23,405 INFO [Thread-4] mapreduce.Job (Job.java:submit(1345)) - The url to track the job: http://localhost:8080/
Also when I run this test on my laptop, I see the following line being printed.
2017-01-27 17:09:27,982 INFO [LocalJobRunner Map Task Executor #0] hdfs.NNBench (NNBench.java:barrier(676)) - Waiting in barrier for: 1018 ms
This line will be printed only in barrier() method and I don't see this line in the logs of failed test.
In our environment, the jenkins server was very slow and it took more than 6 seconds to launch a map task.
The correct fix in my opinion would be to return true in case there is no sleep in barrier() method. Only in exception, it should return false.
Attachments
Issue Links
- links to