[ARROW-13848] [C++] and() in a dataset filter - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: C++
Labels:
- performance

External issue URL:
https://github.com/apache/arrow/issues/29469

Description

Is it expected that a scanning a dataset that has a filter built with and() is much slower than a filter built with and_kleene()? Specifically, it seems that and() triggers a scan of the full dataset, where as and_kleene() takes advantage of the fact that only one directory of the larger dataset needs to be scanned:

> library(arrow)

Attaching package: ‘arrow’

The following object is masked from ‘package:utils’:

    timestamp

> library(dplyr)
> 
> ds <- open_dataset("~/repos/ab_store/data/taxi_parquet/", partitioning = c("year", "month"))
> 
> system.time({
+ out <- ds %>%
+     filter(arrow_and(total_amount > 100, year == 2015)) %>%
+     select(tip_amount, total_amount, passenger_count) %>%
+     collect()
+ })
   user  system elapsed 
 46.634   4.462   6.457 
> 
> system.time({
+ out <- ds %>%
+     filter(arrow_and_kleene(total_amount > 100, year == 2015)) %>%
+     select(tip_amount, total_amount, passenger_count) %>%
+     collect()
+ })
   user  system elapsed 
  4.633   0.421   0.754 
>

I suspect that it's scanning the whole dataset because if I use a dataset that only has the 2015 folder, I get similar speeds:

> ds <- open_dataset("~/repos/ab_store/data/taxi_parquet_2015/", partitioning = c("year", "month"))
> 
> system.time({
+ out <- ds %>%
+     filter(arrow_and(total_amount > 100, year == 2015)) %>%
+     select(tip_amount, total_amount, passenger_count) %>%
+     collect()
+ })
   user  system elapsed 
  4.549   0.404   0.576 
> 
> system.time({
+ out <- ds %>%
+     filter(arrow_and_kleene(total_amount > 100, year == 2015)) %>%
+     select(tip_amount, total_amount, passenger_count) %>%
+     collect()
+ })
   user  system elapsed 
  4.477   0.412   0.585

This does not impact anyone who uses our default collapsing mechanism in the R package, but I bumped into it with a filter that was constructed by duckdb using `and()` instead of `and_kleene()`.

Attachments

Issue Links

is blocked by

ARROW-12659 [C++][Compute] Support SimplifyWithGuarantee(is_null(foo), is_valid(foo))

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Jonathan Keane

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 01/Sep/21 17:15

Updated:: 11/Jan/23 08:35