Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-7224

[C++][Dataset] Partition level filters should be able to provide filtering to file systems

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • C++

    Description

      When providing a filter for partitions, it should be possible in some cases to use it to optimize file system list calls.  This can greatly improve the speed for reading data from partitions because fewer number of directories/files need to be explored/expanded.  I've fallen behind on the dataset code, but I want to make sure this issue is tracked someplace.  This came up in SO question linked below (feel free to correct my analysis if I missed the functionality someplace).

      Reference: https://stackoverflow.com/questions/58868584/pyarrow-parquetdataset-read-is-slow-on-a-hive-partitioned-s3-dataset-despite-u/58951477#58951477

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              emkornfield@gmail.com Micah Kornfield
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: