Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.3.6
-
None
Description
Problem description
If an erasure-coded file is not large enough to fill the stripe width of the EC policy, the block distribution can be suboptimal.
For example, an RS-6-3-1024K EC file smaller than 1024K will have 1 data block and 3 parity blocks. While all 9 (6 + 3) storage locations are chosen by the block placement policy, only 4 of them are used, and the last 3 locations are for parity blocks. If the cluster has a very small number of racks (e.g. 3), with the current scheme to find a pipeline with the shortest path, the last nodes are likely to be in the same rack, resulting in a suboptimal rack distribution.
Locations: N1 N2 N3 N4 N5 N6 N7 N8 N9 Racks: R1 R1 R1 R2 R2 R2 R3 R3 R3 Blocks: D1 P1 P2 P3
We can see that blocks are stored in only 2 racks, not 3.
Because the block does not have enough racks, ErasureCodingWork will later be created to replicate the block to a new rack, however, the current code tries to copy the block to the first node in the chosen locations, regardless of its rack. So it is not guaranteed to improve the situation, and we constantly see PendingReconstructionMonitor timed out messages in the log.
Proposed solution
1. Reorder the chosen locations by rack so that the parity blocks are stored in as many racks as possible.
2. Make ErasureCodingWork try to find a target on a new rack
Real-world test result
We first noticed the problem on our HBase cluster running Hadoop 3.3.6 on 18 nodes across 3 racks. After setting RS-6-3-1024K policy on the HBase data directory, we noticed that
1. FSCK reports "Unsatisfactory placement block groups" for small EC files.
/hbase/***: Replica placement policy is violated for ***. Block should be additionally replicated on 2 more rack(s). Total number of racks in the cluster: 3 ... Erasure Coded Block Groups: ... Unsatisfactory placement block groups: 1475 (2.5252092 %)
2. Namenode keeps logging "PendingReconstructionMonitor timed out" messages every recheck-interval (5 minutes).
3. and FSNamesystem.UnderReplicatedBlocks metric bumps and clears every recheck-interval.
After applying the patch, all the problems are gone. "Unsatisfactory placement block groups" is now zero. No metrics bumps or "timed out" logs.
Attachments
Attachments
Issue Links
- links to