Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-8982

Prevent infinite loop in getContainer which causes log flooded by WritableRatisContainerProvider if pipeline's nodes are not found

    XMLWordPrintableJSON

Details

    Description

      HDDS-8981 revealed an infinite loop.

      Steps (based on broken test case):

      1. Stop all datanodes
      2. Restart SCM
      3. Force SCM exit safe mode
      4. Allocate block

      WritableRatisContainerProvider is stuck in an infinite loop:

      [IPC Server handler 0 on default port 15001] WARN  node.SCMNodeManager (SCMNodeManager.java:minHealthyVolumeNum(1114)) - Cannot generate NodeStat, datanode b1b8f5aa-dc08-49ca-95ee-cd36b9a3d13c not found.
      [IPC Server handler 0 on default port 15001] WARN  node.SCMNodeManager (SCMNodeManager.java:minHealthyVolumeNum(1114)) - Cannot generate NodeStat, datanode d37619cb-6c2f-4ca2-895f-33a0c8ee91be not found.
      [IPC Server handler 0 on default port 15001] WARN  node.SCMNodeManager (SCMNodeManager.java:minHealthyVolumeNum(1114)) - Cannot generate NodeStat, datanode 65bcd90d-da66-4520-87f5-dc49e35f7dc8 not found.
      [IPC Server handler 0 on default port 15001] WARN  container.ContainerManagerImpl (ContainerManagerImpl.java:getMatchingContainer(342)) - Container allocation failed on pipeline=Pipeline[ Id: b7134617-f12a-460c-9ef7-36bb6f2810c6, Nodes: b1b8f5aa-dc08-49ca-95ee-cd36b9a3d13c(localhost/127.0.0.1)d37619cb-6c2f-4ca2-895f-33a0c8ee91be(localhost/127.0.0.1)65bcd90d-da66-4520-87f5-dc49e35f7dc8(localhost/127.0.0.1), ReplicationConfig: RATIS/THREE, State:OPEN, leaderId:65bcd90d-da66-4520-87f5-dc49e35f7dc8, CreationTimestamp2023-07-06T11:51:52.213+02:00[Europe/Vienna]]
      java.lang.IllegalArgumentException
      	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:129)
      	at org.apache.hadoop.hdds.scm.node.SCMNodeManager.minHealthyVolumeNum(SCMNodeManager.java:1118)
      	at org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.minHealthyVolumeNum(PipelineManagerImpl.java:648)
      	at org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.getOpenContainerCountPerPipeline(ContainerManagerImpl.java:349)
      	at org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.getMatchingContainer(ContainerManagerImpl.java:329)
      	at org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.selectContainer(WritableRatisContainerProvider.java:204)
      	at org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.getContainer(WritableRatisContainerProvider.java:101)
      	at org.apache.hadoop.hdds.scm.pipeline.WritableContainerFactory.getContainer(WritableContainerFactory.java:70)
      	at org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:164)
      	at org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:194)
      	at org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:192)
      	at org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.processMessage(ScmBlockLocationProtocolServerSideTranslatorPB.java:142)
      	at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
      	at org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.send(ScmBlockLocationProtocolServerSideTranslatorPB.java:113)
      	at org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:14238)
      	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
      	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
      	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
      	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
      	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1094)
      	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1017)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:422)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
      	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3048)
      

      Pipeline is open, datanodes are stopped, thus they do not register after SCM restart.

      minHealthyVolumeNum throws exception, getMatchingContainer catches it and returns null. The loop in WritableRatisContainerProvider exits only after such pipelines are closed (4,5 minutes, 57768149 iterations in this case):

      disabled per-attempt log, log number of attempts on exit from loop
      2023-07-06 12:48:05,378 [IPC Server handler 3 on default port 15001] INFO  ha.SequenceIdGenerator (SequenceIdGenerator.java:getNextId(140)) - Allocate a batch for delTxnId, change lastId from 0 to 1000.
      ...
      2023-07-06 12:52:33,880 [BackgroundPipelineScrubberThread] INFO  pipeline.PipelineManagerImpl (PipelineManagerImpl.java:closePipeline(510)) - Pipeline Pipeline[ Id: 1fcef9a6-655d-485d-ac83-20ec2bcc9225, Nodes: 9ec03363-f1aa-410a-aa90-5cac6833e36c(localhost/127.0.0.1), ReplicationConfig: RATIS/ONE, State:OPEN, leaderId:9ec03363-f1aa-410a-aa90-5cac6833e36c, CreationTimestamp2023-07-06T12:47:33.874+02:00[Europe/Vienna]] moved to CLOSED state
      2023-07-06 12:52:33,880 [IPC Server handler 0 on default port 15001] INFO  pipeline.WritableRatisContainerProvider (WritableRatisContainerProvider.java:getContainer(158)) - Could not find available pipeline of repConfig: RATIS/THREE even after retrying
      2023-07-06 12:52:33,880 [IPC Server handler 0 on default port 15001] ERROR pipeline.WritableRatisContainerProvider (WritableRatisContainerProvider.java:getContainer(177)) - Give up after 57768149 attempts, unable to allocate block size=4194304, RATIS/THREE
      2023-07-06 12:52:33,880 [IPC Server handler 0 on default port 15001] ERROR block.BlockManagerImpl (BlockManagerImpl.java:allocateBlock(172)) - Unable to allocate a block for the size: 4194304, replicationConfig: RATIS/THREE
      

      Attachments

        Issue Links

          Activity

            People

              adoroszlai Attila Doroszlai
              adoroszlai Attila Doroszlai
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: