Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
Reviewed
-
Description
There are cases when the input to a Hive job are thousands of small files. In this case, there is a mapper for each file. Most of the overhead for spawning all these mappers can be avoided if Hive used CombineFileInputFormat introduced via HADOOP-4565
Options to control this behavior:
hive.input.format (org.apache.hadoop.hive.ql.io.CombineHiveInputFormat (default, if empty), or org.apache.hadoop.hive.ql.io.HiveInputFormat) mapred.min.split.size.per.node (the minimum bytes of data to create a node-local partition, otherwise the data will combine to rack level. default:0) mapred.min.split.size.per.rack (the minimum bytes of data to create a rack-local partition, otherwise the data will combine to global level. default:0) mapred.max.split.size (the max size of each split, will be exceeded because we stop accumulating *after* reaching it, instead of before)
The 3 numbers above must be in non-descending order.
Attachments
Attachments
Issue Links
- blocks
-
HIVE-826 cleanup HiveInputFormat.getRecordReader()
- Open
- is blocked by
-
HADOOP-4565 MultiFileInputSplit can use data locality information to create splits
- Closed
- is related to
-
HIVE-826 cleanup HiveInputFormat.getRecordReader()
- Open
-
HIVE-824 use same mapper for mltiple directories
- Open
- relates to
-
HIVE-824 use same mapper for mltiple directories
- Open