Details
-
Task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
We have seen various issues where RM fails to start due to bad state leading to exceptions on startup.
Eg: https://issues.apache.org/jira/browse/YARN-2340
Another issue we have seen internally is with issues in the capacity scheduler config
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting ResourceManagerjava.lang.IllegalArgumentException: Illegal queue capacity setting, (abs-capacity=0.009548) > (abs-maximum-capacity=0.0095). When label=[]
In such cases, we can't recover until a bug fix is deployed to enable RM to start so that the data can be corrected. And during the time RM is forcefully brought up in those cases, RM can still serve client / AM requests & further complicate things.
Ideally we should be able to fix the database independently of RM unable to startup. But with levelDB which is an embedded database this isn't possible without RM being up. Using seperate tools like leveldb-cli isn't useful always because it requires additional code to handle specific comparators etc & requires to be deployed together with RM binaries etc.
A patch to delete applications from state store was implemented in https://issues.apache.org/jira/browse/YARN-3410 but that won't work for other bad entries in state store like DTs / Master keys / App attempts / CS Conf from which we can't recover
A generic DB access will be helpful to delete / update invalid keys.
A better solution is to create a safe mode feature in RM which starts RM with basic functionality to enable fixing it. RM will not serve client / AM / NM requests in this mode. This mode will enable selective admin functionality only (read / write access to the state store).
Attachments
Issue Links
- links to