Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
In some scenarios (example: reading datasets from Amazon S3), reading columns independently and allowing unbridled Read calls to the underlying file handle can yield suboptimal performance. In such cases, it may be preferable to first read the entire serialized row group into memory then deserialize the constituent columns from this
Note that such an option would not be appropriate as a default behavior for all file handle types since low-selectivity reads (example: reading only 3 columns out of a file with 100 columns) will be suboptimal in some cases. I think it would be better for "high latency" file systems to opt into this option
Attachments
Issue Links
- Dependent
-
PARQUET-1820 [C++] Use a column filter hint to inform read prefetching in Arrow reads
- Resolved
- duplicates
-
PARQUET-1820 [C++] Use a column filter hint to inform read prefetching in Arrow reads
- Resolved
- is related to
-
ARROW-11601 [C++][Dataset] Expose pre-buffering in ParquetFileFormatReaderOptions
- Resolved
- relates to
-
ARROW-8763 [C++] Create RandomAccessFile::WillNeed-like API
- Resolved
-
ARROW-7995 [C++] IO: coalescing and caching read ranges
- Resolved
- links to