Description
Currently, broadcast join in Spark only works while:
1. The value of "spark.sql.autoBroadcastJoinThreshold" bigger than 0(default is 10485760).
2. The size of one of the hive tables less than "spark.sql.autoBroadcastJoinThreshold". To get the size information of the hive table from hive metastore, "hive.stats.autogather" should be set to true in hive or the command "ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan" has been run.
But in Hive, it calculate the size of the file or directory corresponding to the hive table to determine whether to use the map side join, and does not depend on the hive metastore.
This leads to two problems:
1. Spark will not use "broadcast join" when the hive parameter "hive.stats.autogather" is not set to ture or the command "ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan" has not been run because the information of the hive table has not saved in hive metastore . The mode of work in Spark depends on the configuration of Hive.
2. For some reason, we set "hive.stats.autogather" to false in our Hive. For the same SQL, Hive is 4 times faster than Spark because Hive used "map side join" but Spark did not use "broadcast join".
Is it possible to use the mechanism same to hive's to look up the size of a hive tale in Spark.
Attachments
Issue Links
- Blocked
-
SPARK-15365 Metastore relation should fallback to HDFS size if statistics are not available from table meta data.
- Resolved