Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
HDDS-8981 revealed an infinite loop.
Steps (based on broken test case):
- Stop all datanodes
- Restart SCM
- Force SCM exit safe mode
- Allocate block
WritableRatisContainerProvider is stuck in an infinite loop:
[IPC Server handler 0 on default port 15001] WARN node.SCMNodeManager (SCMNodeManager.java:minHealthyVolumeNum(1114)) - Cannot generate NodeStat, datanode b1b8f5aa-dc08-49ca-95ee-cd36b9a3d13c not found. [IPC Server handler 0 on default port 15001] WARN node.SCMNodeManager (SCMNodeManager.java:minHealthyVolumeNum(1114)) - Cannot generate NodeStat, datanode d37619cb-6c2f-4ca2-895f-33a0c8ee91be not found. [IPC Server handler 0 on default port 15001] WARN node.SCMNodeManager (SCMNodeManager.java:minHealthyVolumeNum(1114)) - Cannot generate NodeStat, datanode 65bcd90d-da66-4520-87f5-dc49e35f7dc8 not found. [IPC Server handler 0 on default port 15001] WARN container.ContainerManagerImpl (ContainerManagerImpl.java:getMatchingContainer(342)) - Container allocation failed on pipeline=Pipeline[ Id: b7134617-f12a-460c-9ef7-36bb6f2810c6, Nodes: b1b8f5aa-dc08-49ca-95ee-cd36b9a3d13c(localhost/127.0.0.1)d37619cb-6c2f-4ca2-895f-33a0c8ee91be(localhost/127.0.0.1)65bcd90d-da66-4520-87f5-dc49e35f7dc8(localhost/127.0.0.1), ReplicationConfig: RATIS/THREE, State:OPEN, leaderId:65bcd90d-da66-4520-87f5-dc49e35f7dc8, CreationTimestamp2023-07-06T11:51:52.213+02:00[Europe/Vienna]] java.lang.IllegalArgumentException at com.google.common.base.Preconditions.checkArgument(Preconditions.java:129) at org.apache.hadoop.hdds.scm.node.SCMNodeManager.minHealthyVolumeNum(SCMNodeManager.java:1118) at org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.minHealthyVolumeNum(PipelineManagerImpl.java:648) at org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.getOpenContainerCountPerPipeline(ContainerManagerImpl.java:349) at org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.getMatchingContainer(ContainerManagerImpl.java:329) at org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.selectContainer(WritableRatisContainerProvider.java:204) at org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.getContainer(WritableRatisContainerProvider.java:101) at org.apache.hadoop.hdds.scm.pipeline.WritableContainerFactory.getContainer(WritableContainerFactory.java:70) at org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:164) at org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:194) at org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:192) at org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.processMessage(ScmBlockLocationProtocolServerSideTranslatorPB.java:142) at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87) at org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.send(ScmBlockLocationProtocolServerSideTranslatorPB.java:113) at org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:14238) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1094) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1017) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3048)
Pipeline is open, datanodes are stopped, thus they do not register after SCM restart.
minHealthyVolumeNum throws exception, getMatchingContainer catches it and returns null. The loop in WritableRatisContainerProvider exits only after such pipelines are closed (4,5 minutes, 57768149 iterations in this case):
disabled per-attempt log, log number of attempts on exit from loop
2023-07-06 12:48:05,378 [IPC Server handler 3 on default port 15001] INFO ha.SequenceIdGenerator (SequenceIdGenerator.java:getNextId(140)) - Allocate a batch for delTxnId, change lastId from 0 to 1000. ... 2023-07-06 12:52:33,880 [BackgroundPipelineScrubberThread] INFO pipeline.PipelineManagerImpl (PipelineManagerImpl.java:closePipeline(510)) - Pipeline Pipeline[ Id: 1fcef9a6-655d-485d-ac83-20ec2bcc9225, Nodes: 9ec03363-f1aa-410a-aa90-5cac6833e36c(localhost/127.0.0.1), ReplicationConfig: RATIS/ONE, State:OPEN, leaderId:9ec03363-f1aa-410a-aa90-5cac6833e36c, CreationTimestamp2023-07-06T12:47:33.874+02:00[Europe/Vienna]] moved to CLOSED state 2023-07-06 12:52:33,880 [IPC Server handler 0 on default port 15001] INFO pipeline.WritableRatisContainerProvider (WritableRatisContainerProvider.java:getContainer(158)) - Could not find available pipeline of repConfig: RATIS/THREE even after retrying 2023-07-06 12:52:33,880 [IPC Server handler 0 on default port 15001] ERROR pipeline.WritableRatisContainerProvider (WritableRatisContainerProvider.java:getContainer(177)) - Give up after 57768149 attempts, unable to allocate block size=4194304, RATIS/THREE 2023-07-06 12:52:33,880 [IPC Server handler 0 on default port 15001] ERROR block.BlockManagerImpl (BlockManagerImpl.java:allocateBlock(172)) - Unable to allocate a block for the size: 4194304, replicationConfig: RATIS/THREE
Attachments
Issue Links
- causes
-
HDDS-10070 Intermittent failure in TestWritableRatisContainerProvider
- Resolved
- is related to
-
HDDS-8981 TestRootedOzoneFileSystem runs out of disk space
- Resolved
- links to