Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.4.0
Description
It is found that the Ratis WriteLog retry is "0/0" which means the WriteLog will not retry at all, and the datanode will trigger a pipeline failure to close the pipeline. This might cause a lot of pipeline close events sent by the datanodes during high IO events. Our cluster encountered this issue which caused a pipeline thrashing issue (pipeline kept getting closed and created continuously).
The issue was due to nodeFailureTimeoutMs initialized after newRaftProperties and setStateMachineDataConfigurations which causes an issue.
Need to fix the ordering so that it's the syncTimeoutRetry is calculated correctly (default 30 times).