[HIVE-9523] For partitioned tables same optimizations should be available as for bucketed tables and vice versa: ①[Sort Merge] PARTITION Map join and ②BUCKET pruning - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 0.13.0, 0.13.1, 0.14.0, 1.0.0, 1.1.0
Fix Version/s: None
Component/s: Logical Optimizer, Physical Optimizer, SQL
Labels:
- gsoc2015

Description

Logically and functionally bucketing and partitioning are quite similar - both provide mechanism to segregate and separate the table's data based on its content. Thanks to that significant further optimisations like [partition] PRUNING or [bucket] MAP JOIN are possible.
The difference seems to be imposed by design where the PARTITIONing is open/explicit while BUCKETing is discrete/implicit.
Partitioning seems to be very common if not a standard feature in all current RDBMS while BUCKETING seems to be HIVE specific only.
In a way BUCKETING could be also called by "hashing" or simply "IMPLICIT PARTITIONING".

Regardless of the fact that these two are recognised as two separate features available in Hive there should be nothing to prevent leveraging same existing query/join optimisations across the two.

①[Sort Merge] PARTITION Map join (no progress yet)
Enable Bucket Map Join or better, the Sort Merge Bucket Map Join equivalent optimisations when PARTITIONING is used exclusively or in combination with BUCKETING.

For JOIN conditions where partitioning criteria are used respectively:
⋮
FROM TabA JOIN TabB
ON TabA.partCol1 = TabB.partCol2
AND TabA.partCol2 = TabB.partCol2

the optimizer could/should choose to treat it the same way as with bucketed tables: ⋮
FROM TabC
JOIN TabD
ON TabC.clusteredByCol1 = TabD.clusteredByCol2
AND TabC.clusteredByCol2 = TabD.clusteredByCol2

and use either Bucket Map Join or better, the Sort Merge Bucket Map Join. The latter would require capability to create sorted partitions first.

This is based on fact that same way as buckets translate to separate files, the partitions essentially provide the same mapping.
When data locality is known the optimizer could focus only on joining corresponding partitions rather than whole data sets.

②BUCKET pruning (taken care by HIVE-11525)
Enable partition PRUNING equivalent optimisation for queries on BUCKETED tables

Simplest example is for queries like:
"SELECT … FROM x WHERE colA=123123"
to read only the relevant bucket file rather than all file-buckets that belong to a table.

Attachments

Issue Links

is cloned by

HIVE-12334 Partition Map Join

Open

HIVE-12336 Sort Merge Partition Map Join

Open

HIVE-12337 Sorted Partitions

Open

HIVE-11525 Bucket pruning

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Maciek Kocon

Votes:: 2 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 30/Jan/15 11:27

Updated:: 04/Nov/15 15:41