Description
Currently, localhost is passed as locality for each block, causing all blocks involved in job to initially target the same node (RM), before being moved by the scheduler (to a rack-local node). This reduces parallelism for jobs (with short-lived mappers).
We should mimic Azures implementation: a config setting fs.s3a.block.location.impersonatedhost where the user can enter the list of hostnames in the cluster to return to getFileBlockLocations.
Possible optimization: for larger systems, it might be better to return N (5?) random hostnames to prevent passing a huge array (the downstream code assumes size = O(3)).
Attachments
Issue Links
- relates to
-
HADOOP-14943 Add common getFileBlockLocations() emulation for object stores, including S3A
- Patch Available