[PARQUET-1698] [C++] Add reader option to pre-buffer entire serialized row group into memory - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: cpp-4.0.0
Component/s: parquet-cpp
Labels:
- pull-request-available

Description

In some scenarios (example: reading datasets from Amazon S3), reading columns independently and allowing unbridled Read calls to the underlying file handle can yield suboptimal performance. In such cases, it may be preferable to first read the entire serialized row group into memory then deserialize the constituent columns from this

Note that such an option would not be appropriate as a default behavior for all file handle types since low-selectivity reads (example: reading only 3 columns out of a file with 100 columns) will be suboptimal in some cases. I think it would be better for "high latency" file systems to opt into this option

cc fsaintjacques bkietz apitrou

Attachments

Issue Links

Dependent

PARQUET-1820 [C++] Use a column filter hint to inform read prefetching in Arrow reads

Resolved

duplicates

PARQUET-1820 [C++] Use a column filter hint to inform read prefetching in Arrow reads

Resolved

is related to

ARROW-11601 [C++][Dataset] Expose pre-buffering in ParquetFileFormatReaderOptions

Resolved

relates to

ARROW-8763 [C++] Create RandomAccessFile::WillNeed-like API

Resolved

ARROW-7995 [C++] IO: coalescing and caching read ranges

Resolved

links to

GitHub Pull Request #6138

(1 links to)

Activity

People

Assignee:: David Li

Reporter:: Wes McKinney

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 21/Nov/19 04:15

Updated:: 23/Jun/24 03:31

Resolved:: 24/Mar/21 16:09

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: