Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Duplicate
-
8.0.0, 9.0.0
-
Linux
Description
Hello,
I am trying to do a full join on a dataset. It produces the correct number of observations, but not the correct result (the resulting data.frame is just filled up with NA-rows).
My use case: I want to include the 'full' year range for every factor value:
library(data.table) library(arrow) library(dplyr) year_range <- 2000:2019 group_n <- 100 N <- 1000 ## the resulting data should have 100 groups * 20 years dt <- data.table(value = rnorm(N), group = rep(paste0("g", 1:group_n), length.out = N)) ## there are only observations for some years in every group dt[, year := sample(year_range, size = N / group_n), by = .(group)] dt[group == "g1", ] ## this would be the 'full' data.table group_years <- data.table(group = rep(unique(dt$group), each = 20), year = rep(year_range, times = 10)) group_years[group == "g1", ] write_dataset(dt, path = "parquet_db") db <- open_dataset(sources = "parquet_db") ## full_join using data.table -> expected result db_full <- merge(dt, group_years, by = c("group", "year"), all = TRUE) setorder(db_full, group, year) db_full[group == "g1", ] ## try to do the full_join with arrow -> incorrect result db_full_arrow <- db |> full_join(group_years, by = c("group", "year")) |> collect() |> setDT() setorder(db_full_arrow, group, year) db_full_arrow[group == "g1", ] ## or: convert data.table to arrow_table beforehand -> incorrect result group_years_arrow <- group_years |> as_arrow_table() db_full_arrow <- db |> full_join(group_years_arrow, by = c("group", "year")) |> collect() |> setDT() setorder(db_full_arrow, group, year) db_full_arrow[group == "g1", ]
The documentation says equality joins are supported, which should hold also for `full_join` I guess?
Thanks for your time and work!
Oliver
Attachments
Issue Links
- duplicates
-
ARROW-15838 [C++] Key column behavior in joins
- Resolved
- is blocked by
-
ARROW-15838 [C++] Key column behavior in joins
- Resolved