Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
0.14.0
-
None
-
None
-
None
Description
In MapTask.MapOutputBuffer.spill() every key and value is read from buffer and then written to file with append(key, value):
DataInputBuffer keyIn = new DataInputBuffer(); DataInputBuffer valIn = new DataInputBuffer(); DataOutputBuffer valOut = new DataOutputBuffer(); while (resultIter.next()) { keyIn.reset(resultIter.getKey().getData(), resultIter.getKey().getLength()); key.readFields(keyIn); valOut.reset(); (resultIter.getValue()).writeUncompressedBytes(valOut); valIn.reset(valOut.getData(), valOut.getLength()); value.readFields(valIn); writer.append(key, value); reporter.progress(); }
When you have complex objects, like nutch's ParseData or Inlinks, this takes time and creates lots of garbage.
I've created a patch, it seems to be working, only tested on 0.13.0.
It's a bit clumsy, since ValueBytes is cast to Un-/CompressedBytes in SequenceFile.Writer.
Thoughts?
Attachments
Attachments
Issue Links
- is part of
-
HADOOP-2919 Create fewer copies of buffer data during sort/spill
- Closed