Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
YCSB Fails after 4 days(As the command was triggered around 2024-03-12 17:49:28 and failed at around 2024-03-16 07:43:39), with [INSERT-FAILED] underlying issue being a BLOCK_TOKEN_VERIFICATION issue.
Error in client side:
2024-03-16 07:43:29,233|INFO|Thread-36|machine.py:205 - run()||GUID=16aa1f97-ce35-4680-82a3-b8ac527277b2|2024-03-16 00:43:29:224 309240 sec: 7052507 operations; 8.9 current ops/sec; est completion in 16 days 17 hours [INSERT: Count=89, Max=2588671, Min=17104, Avg=404562.07, 90=2418687, 99=2545663, 99.9=2588671, 99.99=2588671] 2024-03-16 07:43:33,361|INFO|Thread-36|machine.py:205 - run()||GUID=16aa1f97-ce35-4680-82a3-b8ac527277b2|Error inserting, not retrying any more. number of attempts: 1Insertion Retry Limit: 0 2024-03-16 07:43:39,245|INFO|Thread-36|machine.py:205 - run()||GUID=16aa1f97-ce35-4680-82a3-b8ac527277b2|2024-03-16 00:43:39:224 309250 sec: 7052548 operations; 4.1 current ops/sec; est completion in 16 days 17 hours [CLEANUP: Count=2, Max=268, Min=35, Avg=151.5, 90=268, 99=268, 99.9=268, 99.99=268] [INSERT: Count=41, Max=28426239, Min=24736, Avg=1247846.24, 90=2533375, 99=28426239, 99.9=28426239, 99.99=28426239] [INSERT-FAILED: Count=1, Max=38436863, Min=38404096, Avg=38420480, 90=38436863, 99=38436863, 99.9=38436863, 99.99=38436863] 2024-03-16 07:46:22,355|INFO|Thread-37|test_lrt_hbase_ozone.py:165 - open_files_helper()|Sleeping for 5 minutes.
Checked the Hbase Master logs of the same time(Cluster is in PDT TZ, and QE pod is in UTC)
2024-03-16 00:40:53,855 ERROR org.apache.hadoop.hdds.scm.XceiverClientGrpc: Failed to execute command ReadChunk on the pipeline Pipeline[ Id: 781daaaf-6183-45fd-9f1b-f0be538247be, Nodes: 0da8ff95-6d92-4956-83db-dee85e97488e(vc0121.example.com/10.17.207.31)a19eff5a-6feb-417b-b433-62a5f5af80bc(vc0127.example.com/10.17.207.37)b9d66e9e-8c2b-49ff-bf07-3fd3ea146a7f(vc0123.example.com/10.17.207.33), ReplicationConfig: STANDALONE/THREE, State:OPEN, leaderId:0da8ff95-6d92-4956-83db-dee85e97488e, CreationTimestamp2024-03-12T06:12:35.741-07:00[America/Los_Angeles]]. 2024-03-16 00:40:53,855 WARN org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls: Failed to read chunk 113750153625707608_chunk_0 (len=12266) conID: 3060 locID: 113750153625707608 bcsId: 2782589 from 0da8ff95-6d92-4956-83db-dee85e97488e(vc0121.example.com/10.17.207.31); will try another datanode. org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: BLOCK_TOKEN_VERIFICATION_FAILED for null: Expired token for user: hbase (auth:SIMPLE) at org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.validateContainerResponse(ContainerProtocolCalls.java:675) at org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.lambda$createValidators$4(ContainerProtocolCalls.java:686) at org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithRetry(XceiverClientGrpc.java:400) at org.apache.hadoop.hdds.scm.XceiverClientGrpc.lambda$sendCommandWithTraceIDAndRetry$0(XceiverClientGrpc.java:340) at org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:159) at org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149) at org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithTraceIDAndRetry(XceiverClientGrpc.java:335) at org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommand(XceiverClientGrpc.java:316) at org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.readChunk(ContainerProtocolCalls.java:358) at org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.lambda$readChunk$2(ContainerProtocolCalls.java:345) at org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.tryEachDatanode(ContainerProtocolCalls.java:147) at org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.readChunk(ContainerProtocolCalls.java:344) at org.apache.hadoop.hdds.scm.storage.ChunkInputStream.readChunk(ChunkInputStream.java:425) at org.apache.hadoop.hdds.scm.storage.ChunkInputStream.readChunkDataIntoBuffers(ChunkInputStream.java:402) at org.apache.hadoop.hdds.scm.storage.ChunkInputStream.readChunkFromContainer(ChunkInputStream.java:387) at org.apache.hadoop.hdds.scm.storage.ChunkInputStream.prepareRead(ChunkInputStream.java:319) at org.apache.hadoop.hdds.scm.storage.ChunkInputStream.read(ChunkInputStream.java:173) at org.apache.hadoop.hdds.scm.storage.ByteArrayReader.readFromBlock(ByteArrayReader.java:54) at org.apache.hadoop.hdds.scm.storage.BlockInputStream.readWithStrategy(BlockInputStream.java:367) at org.apache.hadoop.hdds.scm.storage.ExtendedInputStream.read(ExtendedInputStream.java:56) at org.apache.hadoop.hdds.scm.storage.ByteArrayReader.readFromBlock(ByteArrayReader.java:54) at org.apache.hadoop.hdds.scm.storage.MultipartInputStream.readWithStrategy(MultipartInputStream.java:96) at org.apache.hadoop.hdds.scm.storage.ExtendedInputStream.read(ExtendedInputStream.java:56) at org.apache.hadoop.fs.ozone.OzoneFSInputStream.read(OzoneFSInputStream.java:64) at org.apache.hadoop.fs.FSInputStream.read(FSInputStream.java:78) at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:97) at org.apache.hadoop.hbase.io.util.BlockIOUtils.preadWithExtra(BlockIOUtils.java:233) at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readAtOffset(HFileBlock.java:1494) at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1717) at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1528) at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1322) at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1242) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$CellBasedKeyBlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:318) at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:664) at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:617)
Can see the ticket was valid in the host(As I mentioned before in the cluster's krb5.conf lifetime is set as 1d and renew as 8d and in our background thread the renew happens on every 8 hour and when we would have got the ticket it would have been 03/20/24 10:49:28 (Renew until limit) - 8 days(Renew) i.e. 03/12/24 10:49:00 so kinit -R seems to have happened about 12 times:
[root@vc0122 ~]# klist Ticket cache: FILE:/tmp/krb5cc_0Default principal: hbase@EXAMPLE.COM Valid starting Expires Service principal 03/16/24 10:49:28 03/17/24 10:49:28 krbtgt/EXAMPLE.COM@EXAMPLE.COM renew until 03/20/24 10:49:28
cc: weichiu
Attachments
Issue Links
- Testing discovered
-
HDDS-7593 Supporting HSync and lease recovery
- Resolved