[HIVE-22993] Include Bloom Filter in Column Statistics to Better Estimate nDV - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: CBO, Statistics
Labels:
None

Description

When performing an INSERT statement, Hive has no way to determine the number of distinct values since the distinct values themselves are not recorded.

create table test_mm(`id` int, `my_dt` date);

insert into test_mm values (1, "2018-10-01"), (2, "2018-10-01"), (3, "2018-10-01"),
(4, "2017-10-01"), (5, "2017-10-01"), (6, "2017-10-01"),
(7, "2010-10-01"), (8, "2010-10-01"), (9, "2010-10-01"),
(10, "1998-10-01"), (11, "1998-10-01"), (12, "1998-10-01");

DESCRIBE FORMATTED test_mm my_dt;
-- distinct_count: 4

insert into test_mm values (13, "2030-10-01"), (14, "2030-10-01"), (15, "2030-10-01");

DESCRIBE FORMATTED test_mm my_dt;
-- distinct_count: 4

The first INSERT statement sees that there are 0 records, so it makes sense that any distinct values marked in the statistics. However, for the second INSERT, Hive has no idea if "2030-10-01" is distinct, so the distinct_count is unchanged. By introducing a bloom filter for column statistics, the second INSERT may be able to determine that "2030-10-01" is indeed unique and update the distinct_count accordingly.

Attachments

Issue Links

is related to

HIVE-9931 Approximate nDV statistics from ORC bloom filter population

Open

Activity

People

Assignee:: Unassigned

Reporter:: David Mollitor

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 06/Mar/20 15:59

Updated:: 06/Mar/20 17:28