Details
-
Sub-task
-
Status: Patch Available
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
The disk health checker verifies a disk by executing mkdir and rmdir periodically.
If these operations does not return in a moderate timeout, the disk should be marked bad, and thus nodeInfo.nodeHealthy should flip to false.
I confirmed that current YARN does not have an implicit timeout (on JDK7, Linux 4.2, ext4) using Earthquake, our fault injector for distributed systems.
(I'll introduce the reproduction script in a while)
I consider we can fix this issue by making NodeHealthCheckerServer.isHealthy() return false if the value of this.getLastHealthReportTime() is too old.
Attachments
Attachments
Issue Links
- is related to
-
YARN-4426 unhealthy disk makes NM LOST
- Resolved