Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.16.0
-
None
-
None
Description
Observed that the WITH clause is materializing the computation it did as many times as it is being referred in the main query. The purpose of CTE is defeated here.
For instance, Query-15 of TPC:
. . . . . . . . . . . . . . )> WITH revenue0(supplier_no , total_revenue) AS ( . . . . . . . . . . . . . . )> SELECT l_suppkey, SUM(l_extendedprice * (1 - l_discount)) . . . . . . . . . . . . . . )> FROM lineitem . . . . . . . . . . . . . . )> WHERE l_shipdate >= '1996-07-01' . . . . . . . . . . . . . . )> AND l_shipdate < DATE_ADD('1996-07-01', INTERVAL '90' DAY) . . . . . . . . . . . . . . )> GROUP BY l_suppkey ) . . . . . . . . . . semicolon> SELECT s_suppkey, s_name, s_address, s_phone, total_revenue . . . . . . . . . . semicolon> FROM supplier, revenue0 . . . . . . . . . . semicolon> WHERE s_suppkey = supplier_no . . . . . . . . . . semicolon> AND total_revenue = (SELECT MAX(total_revenue) FROM revenue0) . . . . . . . . . . semicolon> ORDER BY s_suppkey; +-----------+--------------------+-------------------------------+-----------------+---------------+ | s_suppkey | s_name | s_address | s_phone | total_revenue | +-----------+--------------------+-------------------------------+-----------------+---------------+ | 493 | Supplier#000000493 | 7tdI3AtlDll57sj5K48WLX j5RDbc | 21-252-702-2543 | 1779637.1723 | +-----------+--------------------+-------------------------------+-----------------+---------------+ 1 row selected (9.093 seconds) apache drill (hive.tpch_text)>
I just performed the above on a small subset. You can see the behavior from the explain plan and query profile. But usually, CTE based queries are run on huge tables like in users case(each table running into few TB's!).
Explain plan for above query:
apache drill (hive.tpch_text)> explain plan for . . . . . . . . . . semicolon> WITH revenue0(supplier_no , total_revenue) AS ( . . . . . . . . . . . . . . )> SELECT l_suppkey, SUM(l_extendedprice * (1 - l_discount)) . . . . . . . . . . . . . . )> FROM lineitem . . . . . . . . . . . . . . )> WHERE l_shipdate >= '1996-07-01' . . . . . . . . . . . . . . )> AND l_shipdate < DATE_ADD('1996-07-01', INTERVAL '90' DAY) . . . . . . . . . . . . . . )> GROUP BY l_suppkey ) . . . . . . . . . . semicolon> SELECT s_suppkey, s_name, s_address, s_phone, total_revenue . . . . . . . . . . semicolon> FROM supplier, revenue0 . . . . . . . . . . semicolon> WHERE s_suppkey = supplier_no . . . . . . . . . . semicolon> AND total_revenue = (SELECT MAX(total_revenue) FROM revenue0) . . . . . . . . . . semicolon> ORDER BY s_suppkey; +----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+ | text | json | +----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+ | 00-00 Screen 00-01 Project(s_suppkey=[$0], s_name=[$1], s_address=[$2], s_phone=[$3], total_revenue=[$4]) 00-02 SelectionVectorRemover 00-03 Sort(sort0=[$0], dir0=[ASC]) 00-04 Project(s_suppkey=[$0], s_name=[$1], s_address=[$2], s_phone=[$3], total_revenue=[$5]) 00-05 Project(s_suppkey=[$3], s_name=[$4], s_address=[$5], s_phone=[$6], l_suppkey=[$0], EXPR$1=[$1], EXPR$0=[$2]) 00-06 HashJoin(condition=[=($3, $0)], joinType=[inner], semi-join: =[false]) 00-08 HashJoin(condition=[=($1, $2)], joinType=[inner], semi-join: =[false]) 00-10 HashAgg(group=[{0}], EXPR$1=[SUM($1)]) 00-12 Project(l_suppkey=[$0], EXPR$1=[$1]) 00-14 HashToRandomExchange(dist0=[[$0]]) 01-01 UnorderedMuxExchange 03-01 Project(l_suppkey=[$0], EXPR$1=[$1], E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0, 1301011)]) 03-02 HashAgg(group=[{0}], EXPR$1=[SUM($1)]) 03-03 Project(l_suppkey=[$0], $f1=[*($1, -(1, $2))]) 03-04 SelectionVectorRemover 03-05 Filter(condition=[AND(>=($3, '1996-07-01'), <($3, 1996-09-29 00:00:00))]) 03-06 Scan(table=[[hive, tpch_text, lineitem]], groupscan=[HiveScan [table=Table(dbName:tpch_text, tableName:lineitem), columns=[`l_suppkey`, `l_extendedprice`, `l_discount`, `l_shipdate`], numPartitions=0, partitions= null, inputDirectories=[maprfs:/tpch/lineitem], confProperties={}]]) 00-09 StreamAgg(group=[{}], EXPR$0=[MAX($0)]) 00-11 Project(EXPR$1=[$1]) 00-13 HashAgg(group=[{0}], EXPR$1=[SUM($1)]) 00-15 Project(l_suppkey=[$0], EXPR$1=[$1]) 00-16 HashToRandomExchange(dist0=[[$0]]) 02-01 UnorderedMuxExchange 04-01 Project(l_suppkey=[$0], EXPR$1=[$1], E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0, 1301011)]) 04-02 HashAgg(group=[{0}], EXPR$1=[SUM($1)]) 04-03 Project(l_suppkey=[$0], $f1=[*($1, -(1, $2))]) 04-04 SelectionVectorRemover 04-05 Filter(condition=[AND(>=($3, '1996-07-01'), <($3, 1996-09-29 00:00:00))]) 04-06 Scan(table=[[hive, tpch_text, lineitem]], groupscan=[HiveScan [table=Table(dbName:tpch_text, tableName:lineitem), columns=[`l_suppkey`, `l_extendedprice`, `l_discount`, `l_shipdate`], numPartitions=0, partitions= null, inputDirectories=[maprfs:/tpch/lineitem], confProperties={}]]) 00-07 Scan(table=[[hive, tpch_text, supplier]], groupscan=[HiveScan [table=Table(dbName:tpch_text, tableName:supplier), columns=[`s_suppkey`, `s_name`, `s_address`, `s_phone`], numPartitions=0, partitions= null, inputDirectories=[maprfs:/tpch/supplier], confProperties={}]])
Observe Step 03-06 and 04-06. One of them could be avoided.