Uploaded image for project: 'Tajo (Retired)'
  1. Tajo (Retired)
  2. TAJO-750

Join orders affects abnormal to the result data.

    XMLWordPrintableJSON

Details

    Description

      I found that join orders affects abnormal to the result data as follows:

      Environment

      • DataSet: TPC-DS
      • tajo.dist-query.join.broadcast.auto: false

      Case: 1

      SELECT COUNT(*)
      FROM (SELECT cs.cs_item_sk as cs_item_sk,
                                   cs.cs_ext_discount_amt as cs_ext_discount_amt
                   FROM catalog_sales cs
                   JOIN date_dim d ON (d.d_date_sk = cs.cs_sold_date_sk)
                   WHERE d.d_date between '2000-01-27' and '2000-04-27') cs1
      JOIN item i ON (i.i_item_sk = cs1.cs_item_sk)
      JOIN (SELECT cs2.cs_item_sk as cs_item_sk,
                                1.3 * avg(cs_ext_discount_amt) as avg_cs_ext_discount_amt
                 FROM (SELECT cs.cs_item_sk as cs_item_sk,
                                              cs.cs_ext_discount_amt as cs_ext_discount_amt
                              FROM catalog_sales cs
                              JOIN date_dim d ON (d.d_date_sk = cs.cs_sold_date_sk)
                              WHERE d.d_date between '2000-01-27' and '2000-04-27') cs2
                              GROUP BY cs2.cs_item_sk) tmp1
      ON (i.i_item_sk = tmp1.cs_item_sk)
      
      • actual result: 71147
      • expected result: 4163848

      Case 2

      SELECT COUNT(*)
      FROM item i
      JOIN (SELECT cs.cs_item_sk as cs_item_sk,
                                   cs.cs_ext_discount_amt as cs_ext_discount_amt
                   FROM catalog_sales cs
                   JOIN date_dim d ON (d.d_date_sk = cs.cs_sold_date_sk)
                   WHERE d.d_date between '2000-01-27' and '2000-04-27') cs1 ON (i.i_item_sk = cs1.cs_item_sk)
      JOIN (SELECT cs2.cs_item_sk as cs_item_sk,
                                1.3 * avg(cs_ext_discount_amt) as avg_cs_ext_discount_amt
                 FROM (SELECT cs.cs_item_sk as cs_item_sk,
                                              cs.cs_ext_discount_amt as cs_ext_discount_amt
                              FROM catalog_sales cs
                              JOIN date_dim d ON (d.d_date_sk = cs.cs_sold_date_sk)
                              WHERE d.d_date between '2000-01-27' and '2000-04-27') cs2
                              GROUP BY cs2.cs_item_sk) tmp1
      ON (i.i_item_sk = tmp1.cs_item_sk);
      
      • actual result: 23890
      • expected result: 4163848

      As you knew, two queries doesn't show the expected result. Furthermore, each result have differences. For reference, I made activated result using hive.

      Attachments

        Issue Links

          Activity

            People

              blrunner JaeHwa Jung
              blrunner JaeHwa Jung
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: