Details
Description
The plan for TPC-DS Q78 uses a bad join order with the largest input on the RHS, caused by a bad underestimate of the cardinality of the aggregation node in that join input.
Summary below (full profile including query and plan attached):
F15:ROOT 1 1 0.000ns 0.000ns 4.01 MB 4.00 MB 38:MERGING-EXCHANGE 1 1 0.000ns 0.000ns 100 100 2.81 MB 1.15 MB UNPARTITIONED F04:EXCHANGE SENDER 10 120 33.333us 3.999ms 1.31 KB 0 20:TOP-N 10 120 32.900ms 264.000ms 12.00K 100 52.00 KB 9.38 KB 19:HASH JOIN 10 120 1s694ms 2s044ms 312.51K 8.80M 55.12 KB 0 RIGHT OUTER JOIN, PARTITIONED |--F16:JOIN BUILD 10 120 3s735ms 5s052ms 1.20 GB 17.00 MB | 37:EXCHANGE 10 120 634.767ms 1s004ms 1.44B 8.80M 15.80 MB 25.00 MB HASH(d_year,ss_item_sk,ss_customer_sk) | F09:EXCHANGE SENDER 10 120 4s736ms 5s179ms 279.25 KB 0 | 18:HASH JOIN 10 120 979.068ms 1s140ms 1.44B 8.80M 43.12 KB 0 RIGHT OUTER JOIN, PARTITIONED | |--F17:JOIN BUILD 10 120 5s370ms 6s560ms 1.19 GB 8.50 MB | | 36:EXCHANGE 10 120 390.834ms 640.000ms 1.44B 8.80M 7.20 MB 17.50 MB HASH(d_year,ss_item_sk,ss_customer_sk) | | F14:EXCHANGE SENDER 10 120 3s330ms 4s148ms 279.25 KB 0 | | 35:AGGREGATE 10 120 6s362ms 9s508ms 1.44B 8.80M 960.04 MB 1.02 GB FINALIZE | | 34:EXCHANGE 10 120 517.901ms 788.005ms 1.46B 1.72B 14.31 MB 17.50 MB HASH(d_year,ss_item_sk,ss_customer_sk) | | F12:EXCHANGE SENDER 10 120 8s196ms 9s168ms 3.53 MB 0 | | 05:AGGREGATE 10 120 17s890ms 22s904ms 1.46B 1.72B 992.05 MB 780.95 MB STREAMING | | 04:HASH JOIN 10 120 512.300ms 672.001ms 1.48B 1.72B 37.12 KB 0 INNER JOIN, BROADCAST | | |--F18:JOIN BUILD 10 10 32.400ms 72.000ms 23.27 MB 23.25 MB | | | 33:EXCHANGE 10 10 0.000ns 0.000ns 365 373 16.00 KB 16.00 KB BROADCAST | | | F13:EXCHANGE SENDER 1 1 0.000ns 0.000ns 98.44 KB 0 | | | 02:SCAN S3 1 1 8.000ms 8.000ms 365 373 1.62 MB 16.00 MB tpcds_3000_decimal_parquet.date_dim | | 03:HASH JOIN 10 120 3s542ms 4s200ms 1.48B 8.64B 42.12 KB 0 LEFT OUTER JOIN, PARTITIONED | | |--F19:JOIN BUILD 10 120 1s178ms 1s432ms 384.03 MB 382.93 MB | | | 32:EXCHANGE 10 120 138.600ms 200.000ms 863.99M 863.99M 4.35 MB 12.34 MB HASH(sr_item_sk,sr_ticket_number) | | | F11:EXCHANGE SENDER 10 120 1s982ms 2s628ms 4.16 MB 0 | | | 01:SCAN S3 10 120 385.534ms 628.000ms 863.99M 863.99M 48.97 MB 40.00 MB tpcds_3000_decimal_parquet.store_returns | | 31:EXCHANGE 10 120 787.434ms 2s316ms 1.64B 8.64B 14.32 MB 15.62 MB HASH(ss_item_sk,ss_ticket_number) | | F10:EXCHANGE SENDER 10 120 7s394ms 11s732ms 3.61 MB 0 | | 00:SCAN S3 10 120 2s575ms 4s356ms 1.64B 8.64B 98.15 MB 88.00 MB tpcds_3000_decimal_parquet.store_sales | 30:AGGREGATE 10 120 8s492ms 15s428ms 386.62M 430.99M 324.04 MB 260.31 MB FINALIZE | 29:EXCHANGE 10 120 257.634ms 1s024ms 386.62M 430.99M 14.32 MB 17.50 MB HASH(d_year,ws_item_sk,ws_bill_customer_sk) | F07:EXCHANGE SENDER 10 120 4s837ms 10s996ms 3.53 MB 0 | 11:AGGREGATE 10 120 3s005ms 4s536ms 386.62M 430.99M 82.05 MB 195.24 MB STREAMING | 10:HASH JOIN 10 120 158.866ms 252.000ms 386.62M 430.99M 37.12 KB 0 INNER JOIN, BROADCAST | |--F20:JOIN BUILD 10 10 93.600ms 168.000ms 23.27 MB 23.25 MB | | 28:EXCHANGE 10 10 0.000ns 0.000ns 365 373 16.00 KB 16.00 KB BROADCAST | | F08:EXCHANGE SENDER 1 1 0.000ns 0.000ns 98.44 KB 0 | | 08:SCAN S3 1 1 10s172ms 10s172ms 365 373 1.62 MB 16.00 MB tpcds_3000_decimal_parquet.date_dim | 09:HASH JOIN 10 120 880.001ms 1s096ms 386.63M 2.16B 42.12 KB 0 LEFT OUTER JOIN, PARTITIONED | |--F21:JOIN BUILD 10 120 632.367ms 1s383ms 96.03 MB 95.73 MB | | 27:EXCHANGE 10 120 34.266ms 108.000ms 216.00M 216.00M 1.09 MB 12.34 MB HASH(wr_item_sk,wr_order_number) | | F06:EXCHANGE SENDER 10 120 1s331ms 2s823ms 4.16 MB 0 | | 07:SCAN S3 10 120 11s708ms 12s496ms 216.00M 216.00M 48.92 MB 24.00 MB tpcds_3000_decimal_parquet.web_returns | 26:EXCHANGE 10 120 348.834ms 2s176ms 429.58M 2.16B 14.38 MB 15.62 MB HASH(ws_item_sk,ws_order_number) | F05:EXCHANGE SENDER 10 120 2s616ms 6s356ms 3.61 MB 0 | 06:SCAN S3 10 120 12s429ms 16s108ms 429.58M 2.16B 47.53 MB 88.00 MB tpcds_3000_decimal_parquet.web_sales 25:AGGREGATE 10 120 10s950ms 16s904ms 767.74M 855.91M 648.04 MB 516.96 MB FINALIZE 24:EXCHANGE 10 120 445.234ms 1s184ms 769.46M 855.91M 14.32 MB 17.50 MB HASH(d_year,cs_item_sk,cs_bill_customer_sk) F02:EXCHANGE SENDER 10 120 5s297ms 8s900ms 3.53 MB 0 17:AGGREGATE 10 120 5s660ms 8s160ms 769.46M 855.91M 128.05 MB 387.72 MB STREAMING 16:HASH JOIN 10 120 319.367ms 460.000ms 769.49M 855.91M 37.12 KB 0 INNER JOIN, BROADCAST |--F22:JOIN BUILD 10 10 67.600ms 179.999ms 23.27 MB 23.25 MB | 23:EXCHANGE 10 10 0.000ns 0.000ns 365 373 16.00 KB 16.00 KB BROADCAST | F03:EXCHANGE SENDER 1 1 0.000ns 0.000ns 98.44 KB 0 | 14:SCAN S3 1 1 10s244ms 10s244ms 365 373 1.62 MB 16.00 MB tpcds_3000_decimal_parquet.date_dim 15:HASH JOIN 10 120 1s920ms 2s264ms 769.50M 4.32B 42.12 KB 0 LEFT OUTER JOIN, PARTITIONED |--F23:JOIN BUILD 10 120 1s448ms 1s832ms 192.03 MB 191.47 MB | 22:EXCHANGE 10 120 64.933ms 136.000ms 432.02M 432.02M 6.37 MB 12.34 MB HASH(cr_item_sk,cr_order_number) | F01:EXCHANGE SENDER 10 120 1s830ms 2s568ms 4.16 MB 0 | 13:SCAN S3 10 120 11s753ms 12s600ms 432.02M 432.02M 8.92 MB 32.00 MB tpcds_3000_decimal_parquet.catalog_returns 21:EXCHANGE 10 120 592.967ms 1s720ms 855.00M 4.32B 14.28 MB 15.62 MB HASH(cs_item_sk,cs_order_number) F00:EXCHANGE SENDER 10 120 4s149ms 5s608ms 3.61 MB 0 12:SCAN S3 10 120 13s398ms 18s564ms 855.00M 4.32B 57.61 MB 88.00 MB tpcds_3000_decimal_parquet.catalog_sales
The problematic aggregation is plan node 35 which has an estimated output of 8.8M rows and an actual output of 1.44B.
AggregateNode.java computes the cardinality by multiplying the NDV of the grouping expressions, then capping this value at the number of input rows and then finally applying the conjuncts. The conjunct in this case is the filter d_year=2001 which has selectivity 1/196=0.0051. The fact that we first cap the cardinality at the input size and then apply a very selective predicate to the already capped value results in a bad underestimate in this case.
It might be a better approach to apply the conjunct selectivity to the NDV product first and then cap the result of that based on the number of input rows, to reduce the risk of underestimation like we see in this case. Earlier versions of the code did did the operations in this order (apply conjuncts first, then cap based on input rows) but the order of these operations was reversed by this commit. It's not clear whether that commit intended to change the order of the conjunct application and the cap or if that was just an unintended side effect of code refactoring (maybe tarmstrong can clarify?).
I ran the same query with a straight_join hint to force the correct join order and got a 40% reduction in execution time. I also ran the query (unmodified) on an older version of the code prior to the commit that reversed the order of the conjunct application/cap and the resulting plan has a much better cardinality (see plan node 25) and correct join order (note this is a different cluster, larger scale factor, no mt_dop so don't directly compare the numbers to the more recent profile):
38:MERGING-EXCHANGE 1 771.965us 771.965us 100 100 456.00 KB 0 UNPARTITIONED 20:TOP-N 19 50.204ms 71.974ms 1.90K 100 52.00 KB 9.38 KB 19:HASH JOIN 19 1m4s 1m33s 1.26M 5.75B 11.66 GB 9.23 GB LEFT OUTER JOIN, PARTITIONED |--37:EXCHANGE 19 3s219ms 3s969ms 2.56B 2.85B 7.43 MB 0 HASH(d_year,cs_item_sk,cs_bill_customer_sk) | 36:AGGREGATE 19 1m22s 1m31s 2.56B 2.85B 11.66 GB 175.36 GB FINALIZE | 35:EXCHANGE 19 3s347ms 3s845ms 2.56B 2.85B 13.53 MB 0 HASH(d_year,cs_item_sk,cs_bill_customer_sk) | 17:AGGREGATE 19 46s016ms 53s791ms 2.56B 2.85B 992.06 MB 175.36 GB STREAMING | 16:HASH JOIN 19 3s512ms 3s836ms 2.56B 2.85B 1.99 MB 1.94 MB INNER JOIN, BROADCAST | |--34:EXCHANGE 19 14.477us 24.595us 365 373 18.97 KB 0 BROADCAST | | 14:SCAN HDFS 1 51.493ms 51.493ms 365 373 1.13 MB 32.00 MB tpcds_10000_decimal_parquet.date_dim | 15:HASH JOIN 19 53s103ms 1m8s 2.57B 14.40B 3.22 GB 1.24 GB LEFT OUTER JOIN, PARTITIONED | |--33:EXCHANGE 19 1s193ms 1s712ms 1.44B 1.44B 10.49 MB 0 HASH(cr_item_sk,cr_order_number) | | 13:SCAN HDFS 19 325.397ms 359.082ms 1.44B 1.44B 127.32 MB 144.00 MB tpcds_10000_decimal_parquet.catalog_returns | 32:EXCHANGE 19 3s019ms 3s927ms 2.85B 14.40B 13.22 MB 0 HASH(cs_item_sk,cs_order_number) | 12:SCAN HDFS 19 2s021ms 3s140ms 2.85B 14.40B 355.82 MB 528.00 MB tpcds_10000_decimal_parquet.catalog_sales 18:HASH JOIN 19 40s974ms 55s991ms 4.30B 5.75B 5.88 GB 4.65 GB LEFT OUTER JOIN, PARTITIONED |--31:EXCHANGE 19 1s731ms 1s864ms 1.29B 1.44B 7.16 MB 0 HASH(d_year,ws_item_sk,ws_bill_customer_sk) | 30:AGGREGATE 19 41s146ms 49s094ms 1.29B 1.44B 5.88 GB 88.31 GB FINALIZE | 29:EXCHANGE 19 1s743ms 1s989ms 1.29B 1.44B 13.53 MB 0 HASH(d_year,ws_item_sk,ws_bill_customer_sk) | 11:AGGREGATE 19 18s910ms 20s163ms 1.29B 1.44B 128.06 MB 88.31 GB STREAMING | 10:HASH JOIN 19 1s742ms 1s801ms 1.29B 1.44B 1.99 MB 1.94 MB INNER JOIN, BROADCAST | |--28:EXCHANGE 19 15.043us 24.887us 365 373 18.97 KB 0 BROADCAST | | 08:SCAN HDFS 1 53.015ms 53.015ms 365 373 1.13 MB 32.00 MB tpcds_10000_decimal_parquet.date_dim | 09:HASH JOIN 19 22s658ms 26s425ms 1.29B 7.20B 1.63 GB 636.07 MB LEFT OUTER JOIN, PARTITIONED | |--27:EXCHANGE 19 597.705ms 796.106ms 720.02M 720.02M 10.50 MB 0 HASH(wr_item_sk,wr_order_number) | | 07:SCAN HDFS 19 185.858ms 315.385ms 720.02M 720.02M 98.54 MB 80.00 MB tpcds_10000_decimal_parquet.web_returns | 26:EXCHANGE 19 1s530ms 2s739ms 1.43B 7.20B 13.23 MB 0 HASH(ws_item_sk,ws_order_number) | 06:SCAN HDFS 19 1s589ms 2s720ms 1.43B 7.20B 369.88 MB 528.00 MB tpcds_10000_decimal_parquet.web_sales 25:AGGREGATE 19 1m35s 1m51s 4.30B 5.75B 21.04 GB 353.23 GB FINALIZE 24:EXCHANGE 19 4s893ms 5s539ms 4.31B 5.75B 13.53 MB 0 HASH(d_year,ss_item_sk,ss_customer_sk) 05:AGGREGATE 19 2m40s 3m8s 4.31B 5.75B 20.91 GB 353.23 GB STREAMING 04:HASH JOIN 19 6s349ms 7s056ms 4.92B 5.75B 1.99 MB 1.94 MB INNER JOIN, BROADCAST |--23:EXCHANGE 19 15.995us 38.006us 365 373 18.97 KB 0 BROADCAST | 02:SCAN HDFS 1 49.584ms 49.584ms 365 373 1.13 MB 32.00 MB tpcds_10000_decimal_parquet.date_dim 03:HASH JOIN 19 1m46s 2m20s 4.92B 28.80B 6.41 GB 2.48 GB LEFT OUTER JOIN, PARTITIONED |--22:EXCHANGE 19 2s227ms 3s111ms 2.88B 2.88B 10.49 MB 0 HASH(sr_item_sk,sr_ticket_number) | 01:SCAN HDFS 19 611.931ms 686.818ms 2.88B 2.88B 205.33 MB 176.00 MB tpcds_10000_decimal_parquet.store_returns 21:EXCHANGE 19 5s953ms 7s325ms 5.47B 28.80B 13.40 MB 0 HASH(ss_item_sk,ss_ticket_number) 00:SCAN HDFS 19 2s520ms 3s222ms 5.47B 28.80B 492.66 MB 528.00 MB tpcds_10000_decimal_parquet.store_sales
Attachments
Attachments
Issue Links
- is related to
-
IMPALA-10136 Cardinality estimates for aggregation operations don't consider conjuncts on grouping expressions correctly
- Open