[SPARK-20475] Whether use "broadcast join" depends on hive configuration - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.1.0
Fix Version/s: 2.0.3
Component/s: SQL
Labels:
None

Description

Currently, broadcast join in Spark only works while:
1. The value of "spark.sql.autoBroadcastJoinThreshold" bigger than 0(default is 10485760).
2. The size of one of the hive tables less than "spark.sql.autoBroadcastJoinThreshold". To get the size information of the hive table from hive metastore, "hive.stats.autogather" should be set to true in hive or the command "ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan" has been run.

But in Hive, it calculate the size of the file or directory corresponding to the hive table to determine whether to use the map side join, and does not depend on the hive metastore.

This leads to two problems:
1. Spark will not use "broadcast join" when the hive parameter "hive.stats.autogather" is not set to ture or the command "ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan" has not been run because the information of the hive table has not saved in hive metastore . The mode of work in Spark depends on the configuration of Hive.
2. For some reason, we set "hive.stats.autogather" to false in our Hive. For the same SQL, Hive is 4 times faster than Spark because Hive used "map side join" but Spark did not use "broadcast join".

Is it possible to use the mechanism same to hive's to look up the size of a hive tale in Spark.

Attachments

Issue Links

Blocked

SPARK-15365 Metastore relation should fallback to HDFS size if statistics are not available from table meta data.

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Lijia Liu

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 26/Apr/17 10:26

Updated:: 27/Apr/17 04:46

Resolved:: 27/Apr/17 04:46