Details
-
New Feature
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.13.0
-
None
-
None
Description
When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism.
This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster. An solution should meet the following requirements:
1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.
Attachments
Attachments
Issue Links
- depends upon
-
HADOOP-1846 DatanodeReport should distinguish live datanodes from dead datanodes
- Closed
-
HADOOP-1912 Datanode should support block replacement
- Closed
-
HADOOP-1914 HDFS should have a NamenodeProtocol to allow secondary namenodes and rebalancing processes to communicate with a primary namenode
- Closed
-
HADOOP-1266 Remove DatanodeDescriptor dependency from NetworkTopology
- Closed
- is depended upon by
-
HBASE-57 [hbase] Master should allocate regions to regionservers based upon data locality and rack awareness
- Closed