Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
2.4.0, 2.4.7
-
None
Description
BackGround
I find a rare case which lead some partitions read less data when use zstd;
Detail
I saved normal shuffle data and wrong shuffle data and found the wrong shuffle data was the head part of the normal shuffle data, and I found zstd-jni tag 1.3.3-2 has the problems which can read a head part of whole data and normal exit.
The ZstdInputStream in zstd-jni(tag 1.3.3-2) maybe return 0 after a read function call, this doesn't meet the standard of InputStream, InputStream will not return 0 unless len is 0; Spark will use a BufferedInputStream wrapped to ZstdInputStream, when ZstdInputStream read call return 0, BufferedInputStream will consider the 0 as the end of read and exit, this can lead data loss.
zstd-jni issues:
https://github.com/luben/zstd-jni/issues/159
zstd-jni commits:
https://github.com/luben/zstd-jni/commit/7eec5581b8ccb0d98350ad5ba422337eebbbe70e
zstd-jni has fixed this problems in tag 1.4.4-3, the code as follows:
So, I think it's nessary to upgrade zstd-jni version to 1.4.4-3 in spark2.4 for spark2.4 has a wide use in production.
The BufferedInputStream's code as follows:
Attachments
Attachments
Issue Links
- duplicates
-
SPARK-30228 Update zstd-jni to 1.4.4-3
- Resolved
- links to