Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
This sub-task is meant to leverage the Parquet metadata cache's summary stats: totalRowCount (across all files and row groups) and the per-column totalNullCount (across all files and row groups) to answer plain COUNT aggregation queries without Group-By. These are currently converted to a DirectScan by the ConvertCountToDirectScanRule which utilizes the row group metadata; however this rule is applied on Drill Logical rels and converts the logical plan to a physical plan with DirectScanPrel but this is too late since the DrillScanRel that is already created during logical planning has already read the entire metadata cache file along with its full list of row group entries. The metadata cache file can grow quite large and this does not scale.
The solution is to use the Metadata Summary file that is created in DRILL-7063 and create a new rule that will apply early on such that it operates on the Calcite logical rels instead of the Drill logical rels and prevents eager expansion of the list of files/row groups.
We will not remove the existing rule. The existing rule will continue to operate as before because it is possible that after some transformations, we still want to apply the optimizations for COUNT queries.
Attachments
Issue Links
- depends upon
-
DRILL-7063 Create separate summary file for schema, totalRowCount, totalNullCount (includes maintenance)
- Resolved
- is related to
-
DRILL-3846 Metadata Caching : A count(*) query took more time with the cache in place
- Resolved
- links to