Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Duplicate
-
Impala 2.5.0
-
None
-
None
Description
to_date() is usually used to make timestamp into a date that can be grouped by in analytic functions, but it has a serious impact on the performance (duration and memory) when used
1. attempt around ~2sec, not using to_date()
[morhidi-570-2.gce.cloudera.com:21000] > select dat, emp_no, salary, rank() OVER (PARTITION BY dat order by salary desc) from sample limit 10; Query: select dat, emp_no, salary, rank() OVER (PARTITION BY dat order by salary desc) from sample limit 10 +---------------------+--------+--------+------------------+ | dat | emp_no | salary | rank() OVER(...) | +---------------------+--------+--------+------------------+ | 1985-01-14 00:00:00 | 110114 | 52070 | 1 | | 1985-02-01 00:00:00 | 87761 | 69338 | 1 | | 1985-02-01 00:00:00 | 297266 | 66536 | 2 | | 1985-02-01 00:00:00 | 273524 | 54860 | 3 | | 1985-02-01 00:00:00 | 278257 | 51142 | 4 | | 1985-02-01 00:00:00 | 229588 | 40463 | 5 | | 1985-02-01 00:00:00 | 235233 | 40000 | 6 | | 1985-02-01 00:00:00 | 200241 | 40000 | 6 | | 1985-02-01 00:00:00 | 234495 | 40000 | 6 | | 1985-02-03 00:00:00 | 453187 | 90217 | 1 | +---------------------+--------+--------+------------------+ Fetched 10 row(s) in 2.07s [morhidi-570-2.gce.cloudera.com:21000] > summary; +--------------+--------+----------+----------+-------+------------+----------+---------------+------------------+ | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | +--------------+--------+----------+----------+-------+------------+----------+---------------+------------------+ | 04:EXCHANGE | 1 | 0ns | 0ns | 10 | 10 | 0 B | -1 B | UNPARTITIONED | | 02:ANALYTIC | 3 | 1.67ms | 3.00ms | 20 | -1 | 8.06 MB | 0 B | | | 01:SORT | 3 | 861.03ms | 1.30s | 4.10K | -1 | 64.06 MB | 0 B | | | 03:EXCHANGE | 3 | 23.00ms | 35.00ms | 1.91M | -1 | 0 B | 0 B | HASH(dat) | | 00:SCAN HDFS | 3 | 240.01ms | 253.01ms | 2.84M | -1 | 33.86 MB | 64.00 MB | employees.sample | +--------------+--------+----------+----------+-------+------------+----------+---------------+------------------+
2. attempt around ~1min using to_date()
[morhidi-570-2.gce.cloudera.com:21000] > select dat, emp_no, salary, rank() OVER (PARTITION BY to_date(dat) order by salary desc) from sample limit 10; Query: select dat, emp_no, salary, rank() OVER (PARTITION BY to_date(dat) order by salary desc) from sample limit 10 +---------------------+--------+--------+------------------+ | dat | emp_no | salary | rank() OVER(...) | +---------------------+--------+--------+------------------+ | 1985-02-02 00:00:00 | 96308 | 94821 | 1 | | 1985-02-02 00:00:00 | 204237 | 82183 | 2 | | 1985-02-02 00:00:00 | 461001 | 80475 | 3 | | 1985-02-02 00:00:00 | 257813 | 77248 | 4 | | 1985-02-02 00:00:00 | 98351 | 75641 | 5 | | 1985-02-02 00:00:00 | 490500 | 69491 | 6 | | 1985-02-02 00:00:00 | 51773 | 69401 | 7 | | 1985-02-02 00:00:00 | 33092 | 65665 | 8 | | 1985-02-02 00:00:00 | 430238 | 64479 | 9 | | 1985-02-02 00:00:00 | 81360 | 64371 | 10 | +---------------------+--------+--------+------------------+ Fetched 10 row(s) in 65.26s [morhidi-570-2.gce.cloudera.com:21000] > summary; +--------------+--------+----------+----------+-------+------------+----------+---------------+--------------------+ | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | +--------------+--------+----------+----------+-------+------------+----------+---------------+--------------------+ | 04:EXCHANGE | 1 | 0ns | 0ns | 10 | 10 | 0 B | -1 B | UNPARTITIONED | | 02:ANALYTIC | 3 | 8.67ms | 22.00ms | 10 | -1 | 8.18 MB | 0 B | | | 01:SORT | 3 | 21.06s | 63.17s | 2.05K | -1 | 1.16 GB | 0 B | | | 03:EXCHANGE | 3 | 15.33ms | 29.00ms | 2.84M | -1 | 0 B | 0 B | HASH(to_date(dat)) | | 00:SCAN HDFS | 3 | 177.34ms | 192.01ms | 2.84M | -1 | 33.82 MB | 64.00 MB | employees.sample | +--------------+--------+----------+----------+-------+------------+----------+---------------+--------------------+
Attachments
Issue Links
- duplicates
-
IMPALA-4055 Investigate and fix to_date() slowness
- Resolved