[HADOOP-87] SequenceFile performance degrades substantially compression is on and large values are encountered - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.1.0
Fix Version/s: 0.1.0
Component/s: io
Labels:
None

Description

The code snippet in quesiton is:

if (deflateValues) {
deflateIn.reset();
val.write(deflateIn);
deflater.reset();
deflater.setInput(deflateIn.getData(), 0, deflateIn.getLength());
deflater.finish();
while (!deflater.finished())

{ int count = deflater.deflate(deflateOut); buffer.write(deflateOut, 0, count); }

} else {

A couple of issues with this code:

1. The value is serialized to the 'deflateIn' buffer which is an instance of 'DataOutputBuffer', this grows as large as needed to store the serialized value and stays as large as the largest serialized value encountered. If, for instance a stream has a single 8MB value followed by several 8KB values the size of the buffer stays at 8MB. The problem is that the entire 8MB buffer is always copied over the JNI boundary regardless of the size of the value. We've observed this over several runs where compression performance degrades by a couple of orders of magnitude when a very large value is encountered. Shrinking the buffer fixes the problem.

2. Data is copied lots of times. First the value is serialized into 'deflateIn'. Second, the value is copied over the JNI boundary in every iteration of the while loop. Third, the compressed data is copied piecemeal into 'deflateOut'. Finally, it is appended to 'buffer'.

Proposed fix:

1. Don't let big buffers persist. Allow 'deflateIn' to grow to a persistent maximum reasonable size, say 64KB. If a larger value is encountered, grow the buffer in order to process the value, then shrink it back to the maximum size. To do this, we add a 'reset' method which takes a buffer size.

2. Don't use a loop to deflate. The maximum size of the output can be determined by 'maxOutputSize = inputSize * 1.01 + 12'. This is the maximum output size that zlib will produce. We allocate a large enough output buffer and compress everything in 1 pass. The output buffer, of course, needs to shrink as well.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

hadoop-87-3.txt
21/Mar/06 08:29
1 kB
Doug Cutting
hadoop-87.fix
18/Mar/06 07:30
6 kB
Hairong Kuang
hadoop_87.fix
18/Mar/06 03:30
7 kB
Hairong Kuang

Activity

People

Assignee:: Doug Cutting

Reporter:: Sameer Paranjpye

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 17/Mar/06 06:49

Updated:: 03/Aug/06 17:46

Resolved:: 22/Mar/06 05:42