Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.9.0
-
None
-
None
-
Apache Hadoop 2.7.0
Apache Pig 0.17.0
Apache Parquet 1.9.0
Description
Hi,
I am doing some experiments with Apache Parquet to test Predicate pushdown and effect of different row group sizes. My assumptions are:
1) Parquet reader first read the metadata to filter out row groups and data pages
2) Then, it reads only those row groups and data pages which match the filter.
3) The total size of read should be the sum of row group size and size of meta data.
I have a wide table with 1184 columns. 2 columns are long type and remaining columns are binary. One of the long column is sorted and unique. I disabled dictionary encoding and compression. My file size is 34GB in CSV. I converted it to Parquet. I tried with two options
1) Generate only 1 File of Parquet (i.e. 43GB)
2) Generate multiple files of Parquet (i.e., overall size 43GB).
I allow only 1 Mapper to eliminate the effect of parallelism.
I have a query to search 1 record from the sorted column. The results are for row group 16MB and data page size of 1MB
When there is only 1 file of Parquet.
Input(s):
Successfully read 1 records (22135659519 bytes) from: "/output/wide/16777216/1048576"
When there is multiple file of Parquet
Input(s):
Successfully read 1 records (800413428 bytes) from: "/output/wide/16777216/1048576"
My questions are:
1) Why there is big difference. In one file, I am reading 22GB and with multiple file, It is reading 800MB. This is a bug or what?
2) Why it is not reading 16MB + Size of meta data (which is 252MB). Why it is reading more than that?
3) Can I rely on the pig statistics for estimating bytes read?
4) My assumptions are correct or am I missing something?
Could you please have a look into this problem and guide me if it is a bug ?
Logs are attached with this email.
Thank you
Regards
Rana Faisal