Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.2.1
-
Our repro is on ubuntu xenial LTS, with hadoop 3.2.1 linking to libzstd 1.3.1. The bug is difficult to reproduce in an end-to-end environment (eg running an actual hadoop job with zstd compression) because it's very sensitive to the exact input and output characteristics. I reproduced the bug by turning one of the existing unit tests into a crude fuzzer, but I'm not sure upstream will accept that patch, so I've attached it separately on this ticket.
Note that the existing unit test for testCompressingWithOneByteOutputBuffer fails to reproduce this bug. This is because it's using the license file as input, and this file is too small. libzstd has internal buffering (in our environment it seems to be 128 kilobytes), and the license file is only 10 kilobytes. Thus libzstd is able to consume all the input and compress it in a single call, then return pieces of its internal buffer one byte at a time. Since all the input is consumed in a single call, uncompressedDirectBufOff and uncompressedDirectBufLen are both set to zero and thus the bug does not reproduce.
Our repro is on ubuntu xenial LTS, with hadoop 3.2.1 linking to libzstd 1.3.1. The bug is difficult to reproduce in an end-to-end environment (eg running an actual hadoop job with zstd compression) because it's very sensitive to the exact input and output characteristics. I reproduced the bug by turning one of the existing unit tests into a crude fuzzer, but I'm not sure upstream will accept that patch, so I've attached it separately on this ticket. Note that the existing unit test for testCompressingWithOneByteOutputBuffer fails to reproduce this bug. This is because it's using the license file as input, and this file is too small. libzstd has internal buffering (in our environment it seems to be 128 kilobytes), and the license file is only 10 kilobytes. Thus libzstd is able to consume all the input and compress it in a single call, then return pieces of its internal buffer one byte at a time. Since all the input is consumed in a single call, uncompressedDirectBufOff and uncompressedDirectBufLen are both set to zero and thus the bug does not reproduce.
Description
A bug in index handling causes ZStandardCompressor.c to pass a malformed ZSTD_inBuffer to libzstd. libzstd then returns an "Error (generic)" that gets thrown. The crux of the issue is two variables, uncompressedDirectBufLen and uncompressedDirectBufOff. The hadoop code counts uncompressedDirectBufOff from the start of uncompressedDirectBuf, then uncompressedDirectBufLen is counted from uncompressedDirectBufOff. However, libzstd considers pos and size to both be counted from the start of the buffer. As a result, this line https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L228 causes a malformed buffer to be passed to libzstd, where pos>size. Here's a longer description of the bug in case this abstract explanation is unclear:
Suppose we initialize uncompressedDirectBuf (via setInputFromSavedData) with five bytes of input. This results in uncompressedDirectBufOff=0 and uncompressedDirectBufLen=5 (https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.java#L140-L146).
Then we call compress(), which initializes a ZSTD_inBuffer (https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L195-L196). The definition of those libzstd structs is here https://github.com/facebook/zstd/blob/v1.3.1/lib/zstd.h#L251-L261 - note that we set size=uncompressedDirectBufLen and pos=uncompressedDirectBufOff. The ZSTD_inBuffer gets passed to libzstd, compression happens, etc. When libzstd returns from the compression function, it updates the ZSTD_inBuffer struct to indicate how many bytes were consumed (https://github.com/facebook/zstd/blob/v1.3.1/lib/compress/zstd_compress.c#L3919-L3920). Note that pos is advanced, but size is unchanged.
Now, libzstd does not guarantee that the entire input will be compressed in a single call of the compression function. (Some of the compression libraries used by hadoop, such as snappy, do provide this guarantee, but libzstd is not one of them.) So the hadoop native code updates uncompressedDirectBufOff and uncompressedDirectBufLen using the updated ZSTD_inBuffer: https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L227-L228
Now, returning to our example, we started with 5 bytes of uncompressed input. Suppose libzstd compressed 4 of those bytes, leaving one unread. This would result in a ZSTD_inBuffer struct with size=5 (unchanged) and pos=4 (four bytes were consumed). The hadoop native code would then set uncompressedDirectBufOff=4, but it would also set uncompressedDirectBufLen=1 (five minus four equals one).
Since some of the input was not consumed, we will eventually call compress() again. Then we instantiate another ZSTD_inBuffer struct, this time with size=1 and pos=4. This is a bug - libzstd expects size and pos to both be counted from the start of the buffer, therefore pos>size is unsound. So it returns an error https://github.com/facebook/zstd/blob/v1.3.1/lib/compress/zstd_compress.c#L3932 which gets escalated as a java.lang.InternalError.
I will be submitting a pull request on github with a fix for this bug. The key is that the hadoop code should handle offsets the same way libzstd does, ie uncompressedDirectBufLen should be counted from the start of uncompressedDirectBuf, not from uncompressedDirectBufOff.
Attachments
Attachments
Issue Links
- is caused by
-
HADOOP-13578 Add Codec for ZStandard Compression
- Resolved
- is duplicated by
-
HADOOP-17219 ZStandardCodec compression mail fail(generic error) when encounter specific file
- Resolved
- is related to
-
HDFS-14099 Unknown frame descriptor when decompressing multiple frames in ZStandardDecompressor
- Resolved
- links to