Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
Reviewed
Description
There could be a race condition inside JHS. In our build environment, TestMRJobClient.testJobClient() failed with this exception:
ava.io.FileNotFoundException: File does not exist: hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266) at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068) at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94) at org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551) at org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167)
Root cause:
1. MapReduce job completes
2. CLI calls cluster.getJob(jobid)
3. The job is finished and the client side gets redirected to JHS
4. The job data is missing from CachedHistoryStorage so JHS tries to find the job
5. First it scans the intermediate directory and finds the job
6. The call moveToDone() is scheduled for execution on a separate thread inside moveToDoneExecutor and it starts to run immediately
7. RPC invocation returns with the path pointing to /tmp/hadoop-yarn/staging/history/done_intermediate
8. The call to moveToDone() completes which moves the contents of done_intermediate to done
9. Hadoop CLI tries to download the config file from done_intermediate but it's no longer there
Usually step #6 is slow enough to complete after #7, but sometimes it's faster, causing this race condition.
Attachments
Attachments
Issue Links
- is related to
-
MAPREDUCE-7131 Job History Server has race condition where it moves files from intermediate to finished but thinks file is in intermediate
- Resolved
- relates to
-
MAPREDUCE-7020 Task timeout in uber mode can crash AM
- Resolved