[SPARK-24230] With Parquet 1.10 upgrade has errors in the vectorized reader - ASF JIRA

XML

Word

Printable

JSON

When reading some parquet files can get an error like:

java.io.IOException: expecting more rows but reached last block. Read 0 out of 1194236

This happens when looking for a needle thats pretty rare in a large haystack.

The issue here I believe is that the total row count is calculated at

But we pass the blocks we filtered via

org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups

to the ParquetFileReader constructor.

However the ParquetFileReader constructor will filter the list of blocks again using

if a block is filtered out by the latter method, and not the former the vectorized reader will believe it should see more rows than it will.

the fix I used locally is pretty straight forward:

for (BlockMetaData block : blocks) {
this.totalRowCount += block.getRowCount();
}

goes to

this.totalRowCount = this.reader.getRecordCount();

rdblue do you know if this sounds right? The second filter method in the ParquetFileReader might filter more blocks leading to the count being off?

links to

[Github] Pull Request #21295 (rdblue)