Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.2.0
-
None
-
None
Description
As each reduce is created, it is given the entire set of potential map names. For my large sort jobs with 64k maps, this means that each reduce task is given a two dimensional array that is 5 tasks/map * 64k maps = 320k strings. Since the reduce task is passed from the job tracker to the task tracker and down to the task runner, passing the entire list is very expensive. I suspect that this is the cause of the slow downs that I see in the task trackers heart beats when the reduce tasks are being launched.
I propose that the ReduceTask be changed to just get the count of maps, with ids from 0 .. maps -1.
public ReduceTask(String jobFile, String taskId, int maps, int partition);
Then we need to change the protocol for finding map outputs:
MapOutputLocation[] locateMapOutputs(String jobId, int[] mapIds, int partition);