Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-7566

Performance of Common Table Expression query-15

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.16.0
    • None
    • None

    Description

      Observed that the WITH clause is materializing the computation it did as many times as it is being referred in the main query. The purpose of CTE is defeated here.

      For instance, Query-15 of TPC:

      . . . . . . . . . . . . . . )> WITH revenue0(supplier_no , total_revenue) AS (
      . . . . . . . . . . . . . . )>     SELECT l_suppkey, SUM(l_extendedprice * (1 - l_discount))
      . . . . . . . . . . . . . . )>     FROM lineitem
      . . . . . . . . . . . . . . )>     WHERE l_shipdate >= '1996-07-01'
      . . . . . . . . . . . . . . )>       AND l_shipdate < DATE_ADD('1996-07-01', INTERVAL '90' DAY)
      . . . . . . . . . . . . . . )>     GROUP BY l_suppkey )
      . . . . . . . . . . semicolon> SELECT s_suppkey, s_name, s_address, s_phone, total_revenue
      . . . . . . . . . . semicolon> FROM supplier, revenue0
      . . . . . . . . . . semicolon> WHERE s_suppkey = supplier_no
      . . . . . . . . . . semicolon>   AND total_revenue = (SELECT MAX(total_revenue) FROM revenue0)
      . . . . . . . . . . semicolon> ORDER BY s_suppkey;
      +-----------+--------------------+-------------------------------+-----------------+---------------+
      | s_suppkey |       s_name       |           s_address           |     s_phone     | total_revenue |
      +-----------+--------------------+-------------------------------+-----------------+---------------+
      | 493       | Supplier#000000493 | 7tdI3AtlDll57sj5K48WLX j5RDbc | 21-252-702-2543 | 1779637.1723  |
      +-----------+--------------------+-------------------------------+-----------------+---------------+
      1 row selected (9.093 seconds)
      apache drill (hive.tpch_text)>
      
      

      I just performed the above on a small subset. You can see the behavior from the explain plan and query profile. But usually, CTE based queries are run on huge tables like in users case(each table running into few TB's!).

      Explain plan for above query:

      
      
      apache drill (hive.tpch_text)> explain plan for 
      . . . . . . . . . . semicolon> WITH revenue0(supplier_no , total_revenue) AS (
      . . . . . . . . . . . . . . )>     SELECT l_suppkey, SUM(l_extendedprice * (1 - l_discount))
      . . . . . . . . . . . . . . )>     FROM lineitem
      . . . . . . . . . . . . . . )>     WHERE l_shipdate >= '1996-07-01'
      . . . . . . . . . . . . . . )>       AND l_shipdate < DATE_ADD('1996-07-01', INTERVAL '90' DAY)
      . . . . . . . . . . . . . . )>     GROUP BY l_suppkey )
      . . . . . . . . . . semicolon> SELECT s_suppkey, s_name, s_address, s_phone, total_revenue
      . . . . . . . . . . semicolon> FROM supplier, revenue0
      . . . . . . . . . . semicolon> WHERE s_suppkey = supplier_no
      . . . . . . . . . . semicolon>   AND total_revenue = (SELECT MAX(total_revenue) FROM revenue0)
      . . . . . . . . . . semicolon> ORDER BY s_suppkey;
      +----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
      |                                       text                                       |                                       json                                       |
      +----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
      | 00-00    Screen
      00-01      Project(s_suppkey=[$0], s_name=[$1], s_address=[$2], s_phone=[$3], total_revenue=[$4])
      00-02        SelectionVectorRemover
      00-03          Sort(sort0=[$0], dir0=[ASC])
      00-04            Project(s_suppkey=[$0], s_name=[$1], s_address=[$2], s_phone=[$3], total_revenue=[$5])
      00-05              Project(s_suppkey=[$3], s_name=[$4], s_address=[$5], s_phone=[$6], l_suppkey=[$0], EXPR$1=[$1], EXPR$0=[$2])
      00-06                HashJoin(condition=[=($3, $0)], joinType=[inner], semi-join: =[false])
      00-08                  HashJoin(condition=[=($1, $2)], joinType=[inner], semi-join: =[false])
      00-10                    HashAgg(group=[{0}], EXPR$1=[SUM($1)])
      00-12                      Project(l_suppkey=[$0], EXPR$1=[$1])
      00-14                        HashToRandomExchange(dist0=[[$0]])
      01-01                          UnorderedMuxExchange
      03-01                            Project(l_suppkey=[$0], EXPR$1=[$1], E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0, 1301011)])
      03-02                              HashAgg(group=[{0}], EXPR$1=[SUM($1)])
      03-03                                Project(l_suppkey=[$0], $f1=[*($1, -(1, $2))])
      03-04                                  SelectionVectorRemover
      03-05                                    Filter(condition=[AND(>=($3, '1996-07-01'), <($3, 1996-09-29 00:00:00))])
      03-06                                      Scan(table=[[hive, tpch_text, lineitem]], groupscan=[HiveScan [table=Table(dbName:tpch_text, tableName:lineitem), columns=[`l_suppkey`, `l_extendedprice`, `l_discount`, `l_shipdate`], numPartitions=0, partitions= null, inputDirectories=[maprfs:/tpch/lineitem], confProperties={}]])
      00-09                    StreamAgg(group=[{}], EXPR$0=[MAX($0)])
      00-11                      Project(EXPR$1=[$1])
      00-13                        HashAgg(group=[{0}], EXPR$1=[SUM($1)])
      00-15                          Project(l_suppkey=[$0], EXPR$1=[$1])
      00-16                            HashToRandomExchange(dist0=[[$0]])
      02-01                              UnorderedMuxExchange
      04-01                                Project(l_suppkey=[$0], EXPR$1=[$1], E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0, 1301011)])
      04-02                                  HashAgg(group=[{0}], EXPR$1=[SUM($1)])
      04-03                                    Project(l_suppkey=[$0], $f1=[*($1, -(1, $2))])
      04-04                                      SelectionVectorRemover
      04-05                                        Filter(condition=[AND(>=($3, '1996-07-01'), <($3, 1996-09-29 00:00:00))])
      04-06                                          Scan(table=[[hive, tpch_text, lineitem]], groupscan=[HiveScan [table=Table(dbName:tpch_text, tableName:lineitem), columns=[`l_suppkey`, `l_extendedprice`, `l_discount`, `l_shipdate`], numPartitions=0, partitions= null, inputDirectories=[maprfs:/tpch/lineitem], confProperties={}]])
      00-07                  Scan(table=[[hive, tpch_text, supplier]], groupscan=[HiveScan [table=Table(dbName:tpch_text, tableName:supplier), columns=[`s_suppkey`, `s_name`, `s_address`, `s_phone`], numPartitions=0, partitions= null, inputDirectories=[maprfs:/tpch/supplier], confProperties={}]])
      

       

      Observe Step 03-06 and 04-06. One of them could be avoided.

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            adityaar Aditya Allamraju
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: