[ORC-744] LazyIO of non-filter columns - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.7.0
Fix Version/s: 1.7.0
Component/s: Reader
Labels:
- releasenotes

Description

Background

This feature request started as a result of a large search that is performed with the following characteristics:

The search fields are not part of partition, bucket or sort fields.
The table is a very large table.
The predicates result in very few rows compared to the scan size.
The search columns are a significant subset of selection columns in the query.

Initial analysis showed that we could have a significant benefit by lazily reading the non-search columns only when we have a match. We explore the design and some benchmarks in subsequent sections.

Design

This builds further on ~~ORC-577~~ which currently only restricts deserialization for some selected data types but does not improve on IO.

On a high level the design includes the following components:

SArg to Filter: Converts Search Arguments passed down into filters for efficient application during scans.
Read: Performs the lazy read using the filters.
- Read Filter Columns: Read the filter columns from the file.
- Apply Filter: Apply the filter on the read filter columns.
- Read Select Columns: If filter selects at least a row then read the remaining columns.

This issue has the following tasks that provides further details on the design of the respective components:

~~ORC-741~~: Bug fix related to schema evolution of missing columns in the presence of filters
~~ORC-742~~: LazyIO of non-filter columns
~~ORC-743~~: Conversion of SArg to Filter

Tests

We evaluated this approach against a search job with the following stats:

Table
- Size: ~420 TB
- Data fields: ~120
- Partition fields: 3
Scan
- Search fields: 3 data fields with large (~ 1000 value) IN clauses compounded by OR.
- Select fields: 16 data fields (includes the 3 search fields), 1 partition field
- Search:
  - Size: ~180 TB
  - Records: 3.99 T
- Selected:
  - Size: ~100 MB
  - Records: 1 M

We have observed the following reductions compared with the absence of the patch:

Test	IO Reduction %	CPU Reduction %
Select 16 columns	45	47
SELECT *	70	87

The savings are more significant as you increase the number of select columns with respect to the search columns
When the filter selects most data, no significant penalty observed as a result of 2 IO compared with a single IO
- We do have a penalty as a result of the filter application on the selected records.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2021-01-25-14-34-45-375.png
25/Jan/21 22:34
40 kB
Pavan Lanka

Issue Links

is depended upon by

ORC-980 Filter processing ignores the schema case-sensitivity flag

Closed

ORC-983 Revise filter processing log level/location/message

Closed

is related to

ORC-577 Allow row-level filtering

Closed

SPARK-41798 Upgrade hive-storage-api to 2.8.1

Resolved

Sub-Tasks

There are no Sub-Tasks for this issue.

Activity

People

Assignee:: Pavan Lanka

Reporter:: Pavan Lanka

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 25/Jan/21 22:27

Updated:: 31/Dec/22 05:12

Resolved:: 01/Sep/21 04:25