Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.7.0
Description
Background
This feature request started as a result of a large search that is performed with the following characteristics:
- The search fields are not part of partition, bucket or sort fields.
- The table is a very large table.
- The predicates result in very few rows compared to the scan size.
- The search columns are a significant subset of selection columns in the query.
Initial analysis showed that we could have a significant benefit by lazily reading the non-search columns only when we have a match. We explore the design and some benchmarks in subsequent sections.
Design
This builds further on ORC-577 which currently only restricts deserialization for some selected data types but does not improve on IO.
On a high level the design includes the following components:
- SArg to Filter: Converts Search Arguments passed down into filters for efficient application during scans.
- Read: Performs the lazy read using the filters.
- Read Filter Columns: Read the filter columns from the file.
- Apply Filter: Apply the filter on the read filter columns.
- Read Select Columns: If filter selects at least a row then read the remaining columns.
This issue has the following tasks that provides further details on the design of the respective components:
ORC-741: Bug fix related to schema evolution of missing columns in the presence of filtersORC-742: LazyIO of non-filter columnsORC-743: Conversion of SArg to Filter
Tests
We evaluated this approach against a search job with the following stats:
- Table
- Size: ~420 TB
- Data fields: ~120
- Partition fields: 3
- Scan
- Search fields: 3 data fields with large (~ 1000 value) IN clauses compounded by OR.
- Select fields: 16 data fields (includes the 3 search fields), 1 partition field
- Search:
- Size: ~180 TB
- Records: 3.99 T
- Selected:
- Size: ~100 MB
- Records: 1 M
We have observed the following reductions compared with the absence of the patch:
Test | IO Reduction % | CPU Reduction % |
---|---|---|
Select 16 columns | 45 | 47 |
SELECT * | 70 | 87 |
- The savings are more significant as you increase the number of select columns with respect to the search columns
- When the filter selects most data, no significant penalty observed as a result of 2 IO compared with a single IO
- We do have a penalty as a result of the filter application on the selected records.
Attachments
Attachments
Issue Links
- is depended upon by
-
ORC-980 Filter processing ignores the schema case-sensitivity flag
- Closed
-
ORC-983 Revise filter processing log level/location/message
- Closed
- is related to
-
ORC-577 Allow row-level filtering
- Closed
-
SPARK-41798 Upgrade hive-storage-api to 2.8.1
- Resolved