Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
By default in dplyr (and possibly in pandas too?) coalesces the key column for full joins to be the (non-null) values from both key columns:
> left <- tibble::tibble(
key = c(1, 2),
A = c(0, 1),
)
left_tab <- Table$create(left)
> right <- tibble::tibble(
key = c(2, 3),
B = c(0, 1),
)
right_tab <- Table$create(right)
> left %>% full_join(right)
Joining, by = "key"
# A tibble: 3 × 3
key A B
<dbl> <dbl> <dbl>
1 1 0 NA
2 2 1 0
3 3 NA 1
> left_tab %>% full_join(right_tab) %>% collect()
# A tibble: 3 × 3
key A B
<dbl> <dbl> <dbl>
1 2 1 0
2 1 0 NA
3 NA NA 1
And right join, we would expect the key from the right table to be in the result, but we get the key from the left instead:
> left <- tibble::tibble(
key = c(1, 2),
A = c(0, 1),
)
left_tab <- Table$create(left)
> right <- tibble::tibble(
key = c(2, 3),
B = c(0, 1),
)
right_tab <- Table$create(right)
> left %>% right_join(right)
Joining, by = "key"
# A tibble: 2 × 3
key A B
<dbl> <dbl> <dbl>
1 2 1 0
2 3 NA 1
> left_tab %>% right_join(right_tab) %>% collect()
# A tibble: 2 × 3
key A B
<dbl> <dbl> <dbl>
1 2 1 0
2 NA NA 1
Additionally, we should be able to keep both key columns with an option (cf https://github.com/apache/arrow/blob/9719eae66dcf38c966ae769215d27020a6dd5550/r/R/dplyr-join.R#L32)
Attachments
Issue Links
- blocks
-
ARROW-16897 [R][C++] Full join on Arrow objects is incorrect
- Closed
- depends upon
-
ARROW-15957 [C++] Add option to consolidate key columns in hash join
- Open
- is duplicated by
-
ARROW-16897 [R][C++] Full join on Arrow objects is incorrect
- Closed
- links to