Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
Is it expected that a scanning a dataset that has a filter built with and() is much slower than a filter built with and_kleene()? Specifically, it seems that and() triggers a scan of the full dataset, where as and_kleene() takes advantage of the fact that only one directory of the larger dataset needs to be scanned:
> library(arrow) Attaching package: ‘arrow’ The following object is masked from ‘package:utils’: timestamp > library(dplyr) > > ds <- open_dataset("~/repos/ab_store/data/taxi_parquet/", partitioning = c("year", "month")) > > system.time({ + out <- ds %>% + filter(arrow_and(total_amount > 100, year == 2015)) %>% + select(tip_amount, total_amount, passenger_count) %>% + collect() + }) user system elapsed 46.634 4.462 6.457 > > system.time({ + out <- ds %>% + filter(arrow_and_kleene(total_amount > 100, year == 2015)) %>% + select(tip_amount, total_amount, passenger_count) %>% + collect() + }) user system elapsed 4.633 0.421 0.754 >
I suspect that it's scanning the whole dataset because if I use a dataset that only has the 2015 folder, I get similar speeds:
> ds <- open_dataset("~/repos/ab_store/data/taxi_parquet_2015/", partitioning = c("year", "month")) > > system.time({ + out <- ds %>% + filter(arrow_and(total_amount > 100, year == 2015)) %>% + select(tip_amount, total_amount, passenger_count) %>% + collect() + }) user system elapsed 4.549 0.404 0.576 > > system.time({ + out <- ds %>% + filter(arrow_and_kleene(total_amount > 100, year == 2015)) %>% + select(tip_amount, total_amount, passenger_count) %>% + collect() + }) user system elapsed 4.477 0.412 0.585
This does not impact anyone who uses our default collapsing mechanism in the R package, but I bumped into it with a filter that was constructed by duckdb using `and()` instead of `and_kleene()`.
Attachments
Issue Links
- is blocked by
-
ARROW-12659 [C++][Compute] Support SimplifyWithGuarantee(is_null(foo), is_valid(foo))
- Resolved