[IMPALA-13260] Exchange on the probe side of outer joins could send NULL keys to local target - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Backend
Labels:
None

Epic Color:
ghx-label-9

Description

Currently NULL keys are hashed to a single value and sent to a single fragment instance in partitioned joins. This can cause data skew if the number of NULL keys is large.

If a NULL key guarantees that no row is matched on the build side, then columns from build side will be all NULL and it doesn't matter which fragment instance processes the row.

Always sending rows with NULL key to a local fragment instance would both reduce data skew and make the shuffle cheaper (no compression/network). If mt_dop>0 then to completely avoid data these rows would need to be spread evenly among the local fragment instances.

One caveat is that sending NULL keys locally would "weaken" the partitioning of the fragment, so it is no longer "partitioned by col", but "partitioned by col (with the exception of NULLs)". For example if the outer join is followed by a grouping aggregation that uses the same key, then a shuffle is still needed as the aggregation needs all NULL keys in the same fragment instance.

Attachments

Issue Links

is related to

IMPALA-13261 Consider the effect of NULL keys when choosing BROADCAST vs SHUFFLE join

Open

Activity

People

Assignee:: Unassigned

Reporter:: Csaba Ringhofer

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 30/Jul/24 07:46

Updated:: 30/Jul/24 08:06