Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-13848

[C++] and() in a dataset filter

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • C++

    Description

      Is it expected that a scanning a dataset that has a filter built with and() is much slower than a filter built with and_kleene()? Specifically, it seems that and() triggers a scan of the full dataset, where as and_kleene() takes advantage of the fact that only one directory of the larger dataset needs to be scanned:

      > library(arrow)
      
      Attaching package: ‘arrow’
      
      The following object is masked from ‘package:utils’:
      
          timestamp
      
      > library(dplyr)
      > 
      > ds <- open_dataset("~/repos/ab_store/data/taxi_parquet/", partitioning = c("year", "month"))
      > 
      > system.time({
      + out <- ds %>%
      +     filter(arrow_and(total_amount > 100, year == 2015)) %>%
      +     select(tip_amount, total_amount, passenger_count) %>%
      +     collect()
      + })
         user  system elapsed 
       46.634   4.462   6.457 
      > 
      > system.time({
      + out <- ds %>%
      +     filter(arrow_and_kleene(total_amount > 100, year == 2015)) %>%
      +     select(tip_amount, total_amount, passenger_count) %>%
      +     collect()
      + })
         user  system elapsed 
        4.633   0.421   0.754 
      > 
      

      I suspect that it's scanning the whole dataset because if I use a dataset that only has the 2015 folder, I get similar speeds:

      > ds <- open_dataset("~/repos/ab_store/data/taxi_parquet_2015/", partitioning = c("year", "month"))
      > 
      > system.time({
      + out <- ds %>%
      +     filter(arrow_and(total_amount > 100, year == 2015)) %>%
      +     select(tip_amount, total_amount, passenger_count) %>%
      +     collect()
      + })
         user  system elapsed 
        4.549   0.404   0.576 
      > 
      > system.time({
      + out <- ds %>%
      +     filter(arrow_and_kleene(total_amount > 100, year == 2015)) %>%
      +     select(tip_amount, total_amount, passenger_count) %>%
      +     collect()
      + })
         user  system elapsed 
        4.477   0.412   0.585 
      

      This does not impact anyone who uses our default collapsing mechanism in the R package, but I bumped into it with a filter that was constructed by duckdb using `and()` instead of `and_kleene()`.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jonkeane Jonathan Keane
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: