Details
Description
I found a lot of following log in active RM log file after doing failover RM
2019-01-24 15:43:58,999 WARN org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Cannot get RMApp by appId=application_1542178952162_34746156, just added it to finishedApplications list for cleanup
.....
I looked forward RM logs and find this app had finished before hours
2019-01-23 21:49:55,683 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1542178952162_34746156_000001 State change from FINAL_SAVING to FINISHING
The reason of RM prints " Cannot get RMApp by appId" is as follows:
1. RM failover
2. NM reports all running apps to RM in register request
3. The running apps are from NMContext, some apps may already finished
4. In my cluster, yarn.log-aggregation-enable=false, yarn.nodemanager.log.retain-seconds=86400(1day), so app is kept in NMContext before app has finished for 24 hours
5. My Yarn cluster runs 50k apps per day and 7k nodes, and NM will report many finished apps to RM.
Attachments
Attachments
Issue Links
- is related to
-
YARN-1885 RM may not send the app-finished signal after RM restart to some nodes where the application ran before RM restarts
- Closed
-
YARN-10695 Event related improvement of YARN for better usage.
- Open