[ARROW-15838] [C++] Key column behavior in joins - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 10.0.0
Component/s: C++
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/31278

Description

By default in dplyr (and possibly in pandas too?) coalesces the key column for full joins to be the (non-null) values from both key columns:

> left <- tibble::tibble(
  key = c(1, 2),
  A = c(0, 1),                  
)                      
left_tab <- Table$create(left)
> right <- tibble::tibble(
  key = c(2, 3),
  B = c(0, 1),
)                      
right_tab <- Table$create(right)

> left %>% full_join(right) 
Joining, by = "key"
# A tibble: 3 × 3
    key     A     B
  <dbl> <dbl> <dbl>
1     1     0    NA
2     2     1     0
3     3    NA     1

> left_tab %>% full_join(right_tab) %>% collect()
# A tibble: 3 × 3
    key     A     B
  <dbl> <dbl> <dbl>
1     2     1     0
2     1     0    NA
3    NA    NA     1

And right join, we would expect the key from the right table to be in the result, but we get the key from the left instead:

> left <- tibble::tibble(
  key = c(1, 2),
  A = c(0, 1),                  
)                      
left_tab <- Table$create(left)
> right <- tibble::tibble(
  key = c(2, 3),
  B = c(0, 1),
)                      
right_tab <- Table$create(right)
> left %>% right_join(right)
Joining, by = "key"
# A tibble: 2 × 3
    key     A     B
  <dbl> <dbl> <dbl>
1     2     1     0
2     3    NA     1
> left_tab %>% right_join(right_tab) %>% collect()
# A tibble: 2 × 3
    key     A     B
  <dbl> <dbl> <dbl>
1     2     1     0
2    NA    NA     1

Additionally, we should be able to keep both key columns with an option (cf https://github.com/apache/arrow/blob/9719eae66dcf38c966ae769215d27020a6dd5550/r/R/dplyr-join.R#L32)

Attachments

Issue Links

blocks

ARROW-16897 [R][C++] Full join on Arrow objects is incorrect

Closed

depends upon

ARROW-15957 [C++] Add option to consolidate key columns in hash join

Open

is duplicated by

ARROW-16897 [R][C++] Full join on Arrow objects is incorrect

Closed

links to

GitHub Pull Request #14286

Activity

People

Assignee:: Will Jones

Reporter:: Will Jones

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 03/Mar/22 17:49

Updated:: 11/Jan/23 11:40

Resolved:: 14/Oct/22 20:12

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 20m