Details
Description
This is the race condition that can occur:
- during the first scanIntermediateDirectory(), HistoryFileInfo.moveToDone() is scheduled for job j1
- during the second scanIntermediateDirectory(), j1 is found again and put in the fileStatusList to process
- HistoryFileInfo.moveToDone() is processed in another thread and history files are moved to the finished directory
- the HistoryFileInfo for j1 is removed from jobListCache
- the j1 in fileStatusList is processed and a new HistoryFileInfo for j1 is created (history, conf, and summary files will point to the intermediate user directory, and state will be IN_INTERMEDIATE) and added to the jobListCache
- moveToDone() is scheduled for this new j1
- moveToDone() fails during moveToDoneNow() for the history file because the source path in the intermediate directory does not exist
From this point on, while the new j1 HistoryFileInfo is in the jobListCache, the JobHistoryServer will think the history file is in the intermediate directory. If a user queries this job in the JobHistoryServer UI, they will get
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Could not load history file <scheme>://<host>:<port>/mr-history/intermediate/<user>/job_1529348381246_27275711-1535123223269-<user>-<jobname>-1535127026668-1-0-SUCCEEDED-<queue>-1535126980787.jhist
Noticed this issue while running 2.7.4, but the race condition seems to still exist in trunk.
Attachments
Attachments
Issue Links
- relates to
-
MAPREDUCE-7015 Possible race condition in JHS if the job is not loaded
- Resolved