Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
When DN is offline Read of EC data is failing
Getting the below error message:
GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|There are insufficient datanodes to read the EC block
Stack Trace:
2023-02-03 14:05:31,610|INFO|MainThread|machine.py:188 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|RUNNING: /opt/cloudera/parcels/CDH/bin/ozone sh key get o3://ozone1/vol-x20w7/enc-buck-3yp31/decom_1675432802 /tmp/Get_file1675433131 2023-02-03 14:05:35,968|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|23/02/03 14:05:35 WARN impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-xceiverclientmetrics.properties,hadoop-metrics2.properties 2023-02-03 14:05:36,040|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|23/02/03 14:05:36 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s). 2023-02-03 14:05:36,041|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|23/02/03 14:05:36 INFO impl.MetricsSystemImpl: XceiverClientMetrics metrics system started 2023-02-03 14:05:36,937|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|23/02/03 14:05:36 ERROR scm.XceiverClientGrpc: Failed to execute command GetBlock on the pipeline Pipeline[ Id: 4b386868-0719-4e2f-bd3b-bda45c921f97, Nodes: 0a7dfbbc-9bd4-482a-81ed-9b213ab2bf63(quasar-tgmmij-1.quasar-tgmmij.root.hwx.site/172.27.204.197), ReplicationConfig: STANDALONE/ONE, State:CLOSED, leaderId:, CreationTimestamp2023-02-03T14:05:36.904Z[Etc/UTC]]. 2023-02-03 14:05:36,938|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|23/02/03 14:05:36 INFO storage.BlockInputStream: Unable to read information for block conID: 5007 locID: 111677748019205007 bcsId: 0 from pipeline PipelineID=4b386868-0719-4e2f-bd3b-bda45c921f97: java.util.concurrent.ExecutionException: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io exception 2023-02-03 14:05:36,980|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|23/02/03 14:05:36 ERROR scm.XceiverClientGrpc: Failed to execute command GetBlock on the pipeline Pipeline[ Id: 34a3c677-ed98-428f-a0d9-a19f73f93116, Nodes: 0a7dfbbc-9bd4-482a-81ed-9b213ab2bf63(quasar-tgmmij-1.quasar-tgmmij.root.hwx.site/172.27.204.197), ReplicationConfig: STANDALONE/ONE, State:CLOSED, leaderId:, CreationTimestamp2023-02-03T14:05:36.970Z[Etc/UTC]]. 2023-02-03 14:05:36,981|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|23/02/03 14:05:36 INFO storage.BlockInputStream: Unable to read information for block conID: 5007 locID: 111677748019205007 bcsId: 0 from pipeline PipelineID=34a3c677-ed98-428f-a0d9-a19f73f93116: java.util.concurrent.ExecutionException: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io exception 2023-02-03 14:05:37,014|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|23/02/03 14:05:37 ERROR scm.XceiverClientGrpc: Failed to execute command GetBlock on the pipeline Pipeline[ Id: aee4853d-cc99-43c8-a682-2dc4ad322242, Nodes: 0a7dfbbc-9bd4-482a-81ed-9b213ab2bf63(quasar-tgmmij-1.quasar-tgmmij.root.hwx.site/172.27.204.197), ReplicationConfig: STANDALONE/ONE, State:CLOSED, leaderId:, CreationTimestamp2023-02-03T14:05:37.003Z[Etc/UTC]]. 2023-02-03 14:05:37,016|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|23/02/03 14:05:37 INFO storage.BlockInputStream: Unable to read information for block conID: 5007 locID: 111677748019205007 bcsId: 0 from pipeline PipelineID=aee4853d-cc99-43c8-a682-2dc4ad322242: java.util.concurrent.ExecutionException: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io exception 2023-02-03 14:05:37,039|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|2023-02-03 14:05:37,034 [main] WARN io.ECBlockInputStreamProxy (ECBlockInputStreamProxy.java:read(180)) - Failing over to reconstruction read due to an error in ECBlockReader. Exception Class: org.apache.hadoop.ozone.client.io.BadDataLocationException , Exception Message: java.io.IOException: java.util.concurrent.ExecutionException: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io exception 2023-02-03 14:05:37,040|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|23/02/03 14:05:37 WARN io.ECBlockInputStreamProxy: Failing over to reconstruction read due to an error in ECBlockReader. Exception Class: org.apache.hadoop.ozone.client.io.BadDataLocationException , Exception Message: java.io.IOException: java.util.concurrent.ExecutionException: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io exception 2023-02-03 14:05:37,057|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|23/02/03 14:05:37 WARN erasurecode.ErasureCodeNative: Loading ISA-L failed: Failed to load libisal.so.2 (libisal.so.2: cannot open shared object file: No such file or directory) 2023-02-03 14:05:37,058|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|23/02/03 14:05:37 WARN erasurecode.ErasureCodeNative: ISA-L support is not available in your platform... using builtin-java codec where applicable 2023-02-03 14:05:37,185|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|23/02/03 14:05:37 ERROR scm.XceiverClientGrpc: Failed to execute command GetBlock on the pipeline Pipeline[ Id: b3a482a5-33c9-40dd-8614-bfc136ec4479, Nodes: 8fea5559-5799-4c17-8d34-17aa6672b87a(quasar-tgmmij-2.quasar-tgmmij.root.hwx.site/172.27.183.130), ReplicationConfig: STANDALONE/ONE, State:CLOSED, leaderId:, CreationTimestamp2023-02-03T14:05:37.163Z[Etc/UTC]]. 2023-02-03 14:05:37,187|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|23/02/03 14:05:37 INFO storage.BlockInputStream: Unable to read information for block conID: 5007 locID: 111677748019205007 bcsId: 0 from pipeline PipelineID=b3a482a5-33c9-40dd-8614-bfc136ec4479: java.util.concurrent.ExecutionException: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io exception 2023-02-03 14:05:37,229|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|23/02/03 14:05:37 ERROR scm.XceiverClientGrpc: Failed to execute command GetBlock on the pipeline Pipeline[ Id: d18620b2-70cb-4f07-95b7-45d69980f100, Nodes: 8fea5559-5799-4c17-8d34-17aa6672b87a(quasar-tgmmij-2.quasar-tgmmij.root.hwx.site/172.27.183.130), ReplicationConfig: STANDALONE/ONE, State:CLOSED, leaderId:, CreationTimestamp2023-02-03T14:05:37.220Z[Etc/UTC]]. 2023-02-03 14:05:37,230|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|23/02/03 14:05:37 INFO storage.BlockInputStream: Unable to read information for block conID: 5007 locID: 111677748019205007 bcsId: 0 from pipeline PipelineID=d18620b2-70cb-4f07-95b7-45d69980f100: java.util.concurrent.ExecutionException: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io exception 2023-02-03 14:05:37,260|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|23/02/03 14:05:37 ERROR scm.XceiverClientGrpc: Failed to execute command GetBlock on the pipeline Pipeline[ Id: 08dd8828-a81e-44ac-8757-aa1b66df2c72, Nodes: 8fea5559-5799-4c17-8d34-17aa6672b87a(quasar-tgmmij-2.quasar-tgmmij.root.hwx.site/172.27.183.130), ReplicationConfig: STANDALONE/ONE, State:CLOSED, leaderId:, CreationTimestamp2023-02-03T14:05:37.250Z[Etc/UTC]]. 2023-02-03 14:05:37,261|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|23/02/03 14:05:37 INFO storage.BlockInputStream: Unable to read information for block conID: 5007 locID: 111677748019205007 bcsId: 0 from pipeline PipelineID=08dd8828-a81e-44ac-8757-aa1b66df2c72: java.util.concurrent.ExecutionException: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io exception 2023-02-03 14:05:37,282|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|2023-02-03 14:05:37,279 [main] WARN io.ECBlockReconstructedStripeInputStream (ECBlockReconstructedStripeInputStream.java:loadDataBuffersFromStream(590)) - Failed to read from block conID: 5007 locID: 111677748019205007 bcsId: 0 EC index 5. Excluding the block Exception: java.io.IOException Exception Message: java.util.concurrent.ExecutionException: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io exception 2023-02-03 14:05:37,284|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|23/02/03 14:05:37 WARN io.ECBlockReconstructedStripeInputStream: Failed to read from block conID: 5007 locID: 111677748019205007 bcsId: 0 EC index 5. Excluding the block Exception: java.io.IOException Exception Message: java.util.concurrent.ExecutionException: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io exception 2023-02-03 14:05:37,331|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|23/02/03 14:05:37 ERROR scm.XceiverClientGrpc: Failed to execute command GetBlock on the pipeline Pipeline[ Id: 59920096-eac8-40bd-86c6-4a2fb44edfc7, Nodes: 4e84413f-bf98-4159-914d-5d4eaae5070d(quasar-tgmmij-3.quasar-tgmmij.root.hwx.site/172.27.202.202), ReplicationConfig: STANDALONE/ONE, State:CLOSED, leaderId:, CreationTimestamp2023-02-03T14:05:37.290Z[Etc/UTC]]. 2023-02-03 14:05:37,333|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|23/02/03 14:05:37 INFO storage.BlockInputStream: Unable to read information for block conID: 5007 locID: 111677748019205007 bcsId: 0 from pipeline PipelineID=59920096-eac8-40bd-86c6-4a2fb44edfc7: java.util.concurrent.ExecutionException: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io exception 2023-02-03 14:05:37,362|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|23/02/03 14:05:37 ERROR scm.XceiverClientGrpc: Failed to execute command GetBlock on the pipeline Pipeline[ Id: d8b18f5b-1fbe-4493-b370-08e22eb0e64d, Nodes: 4e84413f-bf98-4159-914d-5d4eaae5070d(quasar-tgmmij-3.quasar-tgmmij.root.hwx.site/172.27.202.202), ReplicationConfig: STANDALONE/ONE, State:CLOSED, leaderId:, CreationTimestamp2023-02-03T14:05:37.351Z[Etc/UTC]]. 2023-02-03 14:05:37,364|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|23/02/03 14:05:37 INFO storage.BlockInputStream: Unable to read information for block conID: 5007 locID: 111677748019205007 bcsId: 0 from pipeline PipelineID=d8b18f5b-1fbe-4493-b370-08e22eb0e64d: java.util.concurrent.ExecutionException: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io exception 2023-02-03 14:05:37,390|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|23/02/03 14:05:37 ERROR scm.XceiverClientGrpc: Failed to execute command GetBlock on the pipeline Pipeline[ Id: 78e3a9ff-df9d-4cbf-a584-b73254e06ce8, Nodes: 4e84413f-bf98-4159-914d-5d4eaae5070d(quasar-tgmmij-3.quasar-tgmmij.root.hwx.site/172.27.202.202), ReplicationConfig: STANDALONE/ONE, State:CLOSED, leaderId:, CreationTimestamp2023-02-03T14:05:37.380Z[Etc/UTC]]. 2023-02-03 14:05:37,392|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|23/02/03 14:05:37 INFO storage.BlockInputStream: Unable to read information for block conID: 5007 locID: 111677748019205007 bcsId: 0 from pipeline PipelineID=78e3a9ff-df9d-4cbf-a584-b73254e06ce8: java.util.concurrent.ExecutionException: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io exception 2023-02-03 14:05:37,411|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|2023-02-03 14:05:37,409 [main] WARN io.ECBlockReconstructedStripeInputStream (ECBlockReconstructedStripeInputStream.java:loadDataBuffersFromStream(590)) - Failed to read from block conID: 5007 locID: 111677748019205007 bcsId: 0 EC index 4. Excluding the block Exception: java.io.IOException Exception Message: java.util.concurrent.ExecutionException: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io exception 2023-02-03 14:05:37,413|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|23/02/03 14:05:37 WARN io.ECBlockReconstructedStripeInputStream: Failed to read from block conID: 5007 locID: 111677748019205007 bcsId: 0 EC index 4. Excluding the block Exception: java.io.IOException Exception Message: java.util.concurrent.ExecutionException: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io exception 2023-02-03 14:05:37,442|INFO|MainThread|machine.py:203 - run()||GUID=76425c29-acbc-4b0b-9b39-623c5879f0b7|There are insufficient datanodes to read the EC block
Additional Debugging RCA was done and found out that there were sufficient number of DN's available at the time of key get operations. Below are the details :
EC Dn's are supposed to be 7 and are 7 in numbers
RATIS has to be 3 and those are 3
EC Data node -
Datanodes':[u'hostname-1.hostname.root.hwx.site', u'hostname-7.hostname.root.hwx.site', u'hostname-2.hostname.root.hwx.site', u'hostname-6.hostname.root.hwx.site', u'hostname-3.hostname.root.hwx.site', u'hostname-5.hostname.root.hwx.site', u'hostname-8.hostname.root.hwx.site'],
Ratis DN available at this point 5
[u'hostname-2.hostname.root.hwx.site', u'hostname-3.hostname.root.hwx.site', u'hostname-1.hostname.root.hwx.site', u'hostname-7.hostname.root.hwx.site', u'hostname-6.hostname.root.hwx.site']
Adding the log files
Attachments
Issue Links
- requires
-
HDDS-7917 EC: ECBlockInputStream should try spare replicas on error
- Resolved
-
HDDS-7918 EC: ECBlockReconstructedStripeInputStream should check for spare replicas before failing an index
- Resolved
-
HDDS-7919 EC: ECPipelineProvider.createForRead should filter out dead replicas and sort replicas
- Resolved
-
HDDS-11261 Get key answer with "There are insufficient datanodes to read the EC block" even nodes amount is sufficient
- Resolved