Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Cannot Reproduce
-
0.9.0
-
None
Description
I found that join orders affects abnormal to the result data as follows:
Environment
- DataSet: TPC-DS
- tajo.dist-query.join.broadcast.auto: false
Case: 1
SELECT COUNT(*) FROM (SELECT cs.cs_item_sk as cs_item_sk, cs.cs_ext_discount_amt as cs_ext_discount_amt FROM catalog_sales cs JOIN date_dim d ON (d.d_date_sk = cs.cs_sold_date_sk) WHERE d.d_date between '2000-01-27' and '2000-04-27') cs1 JOIN item i ON (i.i_item_sk = cs1.cs_item_sk) JOIN (SELECT cs2.cs_item_sk as cs_item_sk, 1.3 * avg(cs_ext_discount_amt) as avg_cs_ext_discount_amt FROM (SELECT cs.cs_item_sk as cs_item_sk, cs.cs_ext_discount_amt as cs_ext_discount_amt FROM catalog_sales cs JOIN date_dim d ON (d.d_date_sk = cs.cs_sold_date_sk) WHERE d.d_date between '2000-01-27' and '2000-04-27') cs2 GROUP BY cs2.cs_item_sk) tmp1 ON (i.i_item_sk = tmp1.cs_item_sk)
- actual result: 71147
- expected result: 4163848
Case 2
SELECT COUNT(*) FROM item i JOIN (SELECT cs.cs_item_sk as cs_item_sk, cs.cs_ext_discount_amt as cs_ext_discount_amt FROM catalog_sales cs JOIN date_dim d ON (d.d_date_sk = cs.cs_sold_date_sk) WHERE d.d_date between '2000-01-27' and '2000-04-27') cs1 ON (i.i_item_sk = cs1.cs_item_sk) JOIN (SELECT cs2.cs_item_sk as cs_item_sk, 1.3 * avg(cs_ext_discount_amt) as avg_cs_ext_discount_amt FROM (SELECT cs.cs_item_sk as cs_item_sk, cs.cs_ext_discount_amt as cs_ext_discount_amt FROM catalog_sales cs JOIN date_dim d ON (d.d_date_sk = cs.cs_sold_date_sk) WHERE d.d_date between '2000-01-27' and '2000-04-27') cs2 GROUP BY cs2.cs_item_sk) tmp1 ON (i.i_item_sk = tmp1.cs_item_sk);
- actual result: 23890
- expected result: 4163848
As you knew, two queries doesn't show the expected result. Furthermore, each result have differences. For reference, I made activated result using hive.
Attachments
Issue Links
- relates to
-
TAJO-748 Shuffle output numbers of join may be inconsistent.
- Resolved