Description
Problem:
If some subdirectory or file changes permission under yarn.nodemanager.local-dirs or yarn.nodemanager.log-dirs, and won't be accessible by the node manager, then the node manager will not reach an unhealthy state, but container runs would fail.
Testing:
- run an example PI job in a cluster
- change the user cache directory of the user to not readable by the node manager. For example:
chmod 222 ./usercache/{user}
- cluster state will stay healthy
- re-run the PI job
- containers will fail on the affected node, with
... Not able to initialize app-cache directories in any of the configured local directories for user ...
Solution:
Add an extra validation to the DirectoryCollection#testdirs to ensure the content of the local-dirs and log-dirs are accessible by the node manager, and turn the node unhealthy if not.
New flag will be introduced to enable this validation: yarn.nodemanager.working-dir-content-accessibility-validation.enabled (default true)