[DRILL-1691] ConvertCountToDirectScan rule should be applicable for 2 or more COUNT aggregates - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.6.0
Fix Version/s: 1.12.0
Component/s: Query Planning & Optimization
Labels:
None

Description

The ConvertCountToDirectScan rule currently only applies if there is a single COUNT or COUNT(column) aggregate without group-by. This rule should be extended to apply for multiple such aggregates since the rule depends on the underlying ParquetGroupScan providing it the correct column value count and retrieving that count for multiple columns should be fine. However, if even 1 such column does not have statistics, then we should not apply this rule.

Here's an example sequence:

First do a CTAS such that we ensure that statistics are present for the
table (the original Parquet data may not have stats):

0: jdbc:drill:zk=local> create table nation3 as select * from cp.`tpch/nation.parquet`;
+------------+---------------------------+
|  Fragment  | Number of records written |
+------------+---------------------------+
| 0_0        | 25                        |
+------------+---------------------------+

The Explain below shows the count is retrieved directly from the Scan:

0: jdbc:drill:zk=local> explain plan for select count(n_regionkey) as x from nation3;
+------------+------------+
|    text    |    json    |
+------------+------------+
| 00-00    Screen
00-01      Project(x=[$0])
00-02        Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@5db6cb92])

The following query which does 2 aggregates causes the StreamAgg to be introduced in the plan which is not needed:

0: jdbc:drill:zk=local> explain plan for select count(n_regionkey) as x, count(n_nationkey) as y from nation3;
+------------+------------+
|    text    |    json    |
+------------+------------+
| 00-00    Screen
00-01      Project(x=[$0], y=[$1])
00-02        StreamAgg(group=[{}], x=[COUNT($0)], y=[COUNT($1)])
00-03          Project(n_regionkey=[$1], n_nationkey=[$0])
00-04            Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=file:/tmp/nation3]], selectionRoot=/tmp/nation3, numFiles=1, columns=[`n_regionkey`, `n_nationkey`]]])

Attachments

Issue Links

Is contained by

DRILL-4735 Count(dir0) on parquet returns 0 result

Resolved

Activity

People

Assignee:: Arina Ielchiieva

Reporter:: Aman Sinha

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 12/Nov/14 01:35

Updated:: 15/Aug/17 15:08

Resolved:: 15/Aug/17 15:08