Description
The error message in current master:
java.lang.IllegalArgumentException at java.nio.Buffer.position(Buffer.java:244) at org.apache.orc.impl.InStream$CompressedStream.setCurrent(InStream.java:453) at org.apache.orc.impl.InStream$CompressedStream.readHeaderByte(InStream.java:462) at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:474) at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528) at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:515)
The same error can appear a little differently in older version:
java.io.IOException: Seek outside of data in compressed stream Stream for column 15 kind DATA position: 111956 length: 383146 range: 0 offset: 36674 limit: 36674 range 0 = 75
282 to 36674; range 1 = 151666 to 40267; range 2 = 228805 to 41623 uncompressed: 1024 to 1024 to 111956
Here is the info extracted from the problematic orc file:
Compression: ZLIB Compression size: 1024 Calendar: Julian/Gregorian Type: struct<col:timestamp> Row group indices: Entry 0: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 2021-12-06 21:09:56.000999999 positions: 0,0,0,0,0,0 Entry 1: count: 10000 hasNull: false min: 2016-03-09 11:36:36.0 max: 2021-12-06 21:53:33.000999999 positions: 37509,683,98,2630,296,3 Entry 2: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 2026-05-29 00:53:10.000999999 positions: 75265,256,229,5902,370,8 Entry 3: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 2021-12-06 18:14:52.000999999 positions: 109907,934,398,8581,322,18
To understand this issue, we need to understand the meaning of each number for the row group index. For each compressed stream, we need 3 numbers to record positions. The first number is the position of current compressed stream, followed by the number of bytes left in the uncompressed buffer, and finally the number of values left in the RLE writer. Let's take entry 3 for explanation. 109907 is the position of compressed stream after we processed all the values for entry 2, 934 uncompressed bytes for entry 2 still need to be consumed, and 398 values for entry 2 in RLE writer still need to consumed. Here we have 6 numbers because TimeStamp columns use two streams, one for seconds and the other for nanoseconds.
The issue happened when entry 2 is selected and read due to incorrect end offset for this row group. To be more specific, when compression size is smaller than 4096, there are edge cases we cannot accommodate all the blocks by the factor of 2 (please see the code snippet below).
public static long estimateRgEndOffset(boolean isCompressed, int bufferSize, boolean isLast, long nextGroupOffset, long streamLength) { // figure out the worst case last location // if adjacent groups have the same compressed block offset then stretch the slop // by factor of 2 to safely accommodate the next compression block. // One for the current compression block and another for the next compression block. long slop = isCompressed? 2 * (OutStream.HEADER_SIZE + bufferSize): WORST_UNCOMPRESSED_SLOP; return isLast ? streamLength : Math.min(streamLength, nextGroupOffset + slop); }
In our case, we need slop > 934 (buffer) + 398 * 4 + header bytes, but slop = 1027 * 2 = 2054 (current implementation). That causes seeking outside of range. Here we just need 4 bytes for each value, but it can use 8 bytes at worst case.
In terms of the worst case, we might have uncompressed block in compressed stream. Suppose compression size = C, the factor = 1 (buffer) + (511 * 8 + header bytes) / C.
C = 1024 -> factor should be 5
C = 512 -> factor should be 9 ... and so forth.
Attachments
Issue Links
- links to