Description
Reading the attached ORC file with SearchArgument "sr_return_amt > 10000" using the C++ reader will fail with
Corrupt PATCHED_BASE encoded data (pl==0)!
It's ok to read it without the SearchArgument. The java reader is able to read it with the same SearchArgument.
Attached the source codes (scan_with_sarg.cc) for reproducing the issue. Build the ORC lib and compile it by
g++ scan_with_sarg.cc -o scan_with_sarg -I../c++/include -Ic++/include -Lc++/src/ -Lsnappy_ep-prefix/src/snappy_ep-build/ -Llz4_ep-prefix/src/lz4_ep-build/ -Lzlib_ep-prefix/src/zlib_ep-build/ -Lzstd_ep-prefix/src/zstd_ep-build/lib/ -Lprotobuf_ep-prefix/src/protobuf_ep-build/ -lorc -lz -lsnappy -llz4 -lzstd -lprotobuf
Run it as
$ LD_LIBRARY_PATH="$LD_LIBRARY_PATH:zstd_ep-prefix/src/zstd_ep-build/lib/" ./scan_with_sarg leaf-0 = (column(id=17) <= 10000), expr = (not leaf-0) terminate called after throwing an instance of 'orc::ParseError' what(): Corrupt PATCHED_BASE encoded data (pl==0)! Aborted (core dumped)
RCA
The sarg introduces a seek to RowGroup 42. The following codes in DecompressionStream::seek didn't handle the case when uncompressedBufferLength < posInChunk. Then seeks to an illegal position and the length overflow.
if (headerPosition == seekedPosition && inputBufferStartPosition <= headerPosition + 3 && inputBufferStart) { position.next(); // Skip the input level position. size_t posInChunk = position.next(); // Chunk level position. // Overflow here! uncompressedBufferLength=30950, posInChunk=39498 outputBufferLength = uncompressedBufferLength - posInChunk; outputBuffer = outputBufferStart + posInChunk; return; }
That chunk is an uncompressed chunk, and the whole chunk is read in pieces. The position (posInChunk) hasn't been read out yet. We need to handle this case.
I think this only happens on uncompressed chunks. For compressed chunks, they are decompressed as a whole. So posInChunk will always be valid in the output buffer.
Attachments
Attachments
Issue Links
- is caused by
-
ORC-614 Implement efficient seek() in decompression streams
- Closed
- links to