Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.12.0
-
None
Description
When tables are CLUSTERED BY (key) into multiple buckets, a query which specifies a key in the query predicate will still scan all buckets in the directory.
In the ideal scenario, only bucket needs to be inspected for a given key, particularly if hive.enforce.bucketing is turned on.
When a simple filter query like the following is run
select * from store_sales where ss_item_sk = 1;
The log files contain
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://hadoop1.lxc:56565/user/hive/warehouse/hive_bucketed.db/store_sales/000005_0 org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://hadoop1.lxc:56565/user/hive/warehouse/hive_bucketed.db/store_sales/000006_0 org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://hadoop1.lxc:56565/user/hive/warehouse/hive_bucketed.db/store_sales/000007_0 org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://hadoop1.lxc:56565/user/hive/warehouse/hive_bucketed.db/store_sales/000008_0 org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://hadoop1.lxc:56565/user/hive/warehouse/hive_bucketed.db/store_sales/000009_0
This is going through 32x the amount of data, compared to the right approach of scanning only the partitions which match the predicate.
Attachments
Attachments
Issue Links
- duplicates
-
HIVE-5831 filter input files for bucketed tables
- Resolved