[SPARK-23839] consider bucket join in cost-based JoinReorder rule - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: 2.3.0
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed

Description

Since spark 2.2, the cost-based JoinReorder rule is implemented and in Spark 2.3 released, it is improved with histogram. While it doesn't take the cost of the different join implementations. For example:

TableA JOIN TableB JOIN TableC

TableA will output 10,000 rows after filter and projection.

TableB will output 10,000 rows after filter and projection.

TableC will output 8,000 rows after filter and projection.

The current JoinReorder rule will possibly optimize the plan to join TableC with TableA firstly and then TableB. But if the TableA and TableB are bucket tables and can be applied with BucketJoin, it could be a different story.

Also, to support bucket join of more than 2 tables when table bucket number is multiple of another (~~SPARK-17570~~), whether bucket join can take effect depends on the result of JoinReorder. For example of "A join B join C" which has bucket number like 8, 4, 12, JoinReorder rule should optimize the order to "A join B join C“ to make the bucket join take effect instead of "C join A join B".

Based on current CBO JoinReorder, there are possibly 2 part to be changed:

CostBasedJoinReorder rule is applied in optimizer phase while we do Join selection in planner phase and bucket join optimization in EnsureRequirements which is in preparation phase. Both are after optimizer.
Current statistics and join cost formula are based data selectivity and cardinality, we need to add statistics for present the join method cost like shuffle, sort, hash and etc. Also we need to add the statistics into the formula to estimate the join cost.

Attachments

Issue Links

relates to

SPARK-17570 Avoid Hash and Exchange in Sort Merge join if bucketing factor is multiple for tables

Resolved

SPARK-16026 Cost-based Optimizer Framework

Resolved

SPARK-21975 Histogram support in cost-based optimizer

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Xiaoju Wu

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 01/Apr/18 09:16

Updated:: 12/Dec/22 18:10

Resolved:: 08/Oct/19 05:41