Details
Description
In a recent investigation, we have seen a weird block recovery issue, which is difficult to reach to a conclusion because of insufficient logs.
For the most critical part of the events, we see block recovery failed to {{commitBlockSynchronization]} on the NN, due to the block not closed. This leaves the file as open forever (for 1+ months).
The reason the block was not closed on NN, was because it is configured with dfs.namenode.replication.min =2, and only 1 replica was with the latest genstamp.
We were not able to tell why only 1 replica is on latest genstamp.
From the primary node of the recovery (ps2204), initReplicaRecoveryImpl was called on each of the 7 DNs the block were ever placed. All DNs but ps2204 and ps3765 failed because of genstamp comparison - that's expected. ps2204 and ps3765 have gone past the comparison (since no exceptions from their logs), but updateReplicaUnderRecovery only appeared to be called on ps3765.
This jira is to propose we log more details when BlockRecoveryWorker is about to call updateReplicaUnderRecovery on the DataNodes, so this could be figured out in the future.
$ grep "updateReplica:" ps2204.dn.log $ grep "updateReplica:" ps3765.dn.log hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.933Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"updateReplicaUnderRecovery","level":"INFO","line_number":"2512","thread_name":"IPC Server handler 6 on 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"updateReplica: BP-550436645-17.142.147.13-1438988035284:blk_2172795728_1106150312, recoveryId=1107074793, length=65024, replica=ReplicaUnderRecovery, blk_2172795728_1106150312, RUR $ grep "initReplicaRecovery:" ps2204.dn.log hadoop-hdfs-datanode-ps2204.log.1:{"@timestamp":"2017-09-13T00:56:20.691Z","source_host":"ps2204.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5ae3cb26","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: blk_2172795728_1106150312, recoveryId=1107074793, replica=ReplicaWaitingToBeRecovered, blk_2172795728_1106150312, RWR hadoop-hdfs-datanode-ps2204.log.1:{"@timestamp":"2017-09-13T00:56:20.691Z","source_host":"ps2204.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2497","thread_name":"org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5ae3cb26","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: changing replica state for blk_2172795728_1106150312 from RWR to RUR","class":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","mdc":{}} $ grep "initReplicaRecovery:" ps3765.dn.log hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.457Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"IPC Server handler 5 on 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: blk_2172795728_1106150312, recoveryId=1107074793, replica=ReplicaBeingWritten, blk_2172795728_1106150312, RBW hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.457Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"IPC Server handler 5 on 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: blk_2172795728_1106150312, recoveryId=1107074793, replica=ReplicaBeingWritten, blk_2172795728_1106150312, RBW hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.457Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2497","thread_name":"IPC Server handler 5 on 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: changing replica state for blk_2172795728_1106150312 from RBW to RUR","class":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","mdc":{}}
P.S. HDFS-11499 was once suspected, but non-conclusive since we don't have all NN logs to know about the decommission.