Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-10717

nodeFailureTimeoutMs should be initialized before syncTimeoutRetry

    XMLWordPrintableJSON

Details

    Description

      It is found that the Ratis WriteLog retry is "0/0" which means the WriteLog will not retry at all, and the datanode will trigger a pipeline failure to close the pipeline. This might cause a lot of pipeline close events sent by the datanodes during high IO events. Our cluster encountered this issue which caused a pipeline thrashing issue (pipeline kept getting closed and created continuously).

      The issue was due to nodeFailureTimeoutMs initialized after newRaftProperties and setStateMachineDataConfigurations which causes an issue.

      Need to fix the ordering so that it's the syncTimeoutRetry is calculated correctly (default 30 times).

      Attachments

        Issue Links

          Activity

            People

              ivanandika Ivan Andika
              ivanandika Ivan Andika
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: