[ARROW-7224] [C++][Dataset] Partition level filters should be able to provide filtering to file systems - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: C++
Labels:
- dataset

External issue URL:
https://github.com/apache/arrow/issues/16972

Description

When providing a filter for partitions, it should be possible in some cases to use it to optimize file system list calls. This can greatly improve the speed for reading data from partitions because fewer number of directories/files need to be explored/expanded. I've fallen behind on the dataset code, but I want to make sure this issue is tracked someplace. This came up in SO question linked below (feel free to correct my analysis if I missed the functionality someplace).

Reference: https://stackoverflow.com/questions/58868584/pyarrow-parquetdataset-read-is-slow-on-a-hive-partitioned-s3-dataset-despite-u/58951477#58951477

Attachments

Issue Links

is related to

ARROW-11781 [Python] Reading small amount of files from a partitioned dataset is unexpectedly slow

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Micah Kornfield

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 21/Nov/19 07:17

Updated:: 11/Jan/23 07:52