Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
3.0.0
Description
Related to: https://issues.apache.org/jira/browse/ARROW-12315
Please see https://drive.google.com/drive/folders/1e7WB36FPYzvdtm46dgAEEFAKAWDQs-e1?usp=sharing where I added the raw data and the output.
This works:
library(data.table) library(dplyr) library(arrow) d <- fread( input = "01-raw-data/sitc-rev2/parquet/type-C_r-ALL_ps-2019_freq-A_px-S2_pub-20210216_fmt-csv_ex-20210227.csv", colClasses = list( character = "Commodity Code", numeric = c("Trade Value (US$)", "Qty", "Netweight (kg)") )) d <- d %>% mutate( `Reporter ISO` = case_when( `Reporter ISO` %in% c(NA, "", " ") ~ "0-unspecified", TRUE ~ `Reporter ISO` ), `Partner ISO` = case_when( `Partner ISO` %in% c(NA, "", " ") ~ "0-unspecified", TRUE ~ `Partner ISO` ) ) # d %>% # select(Year, `Reporter ISO`, `Partner ISO`) %>% # distinct() %>% # dim() d %>% group_by(Year, `Reporter ISO`) %>% write_dataset("parquet", hive_style = F, max_partitions = 1024L)
But, if I add an additional column for partioning and increases the max partitions to 12808 (to pass exactly the number of partitions that it needs), I get the error:
d %>% group_by(Year, `Reporter ISO`) %>% write_dataset("parquet", hive_style = F, max_partitions = 12808) Error: IOError: Failed to open local file '/media/pacha/pacha_backup/tradestatistics/yearly-datasets-arrow/01-raw-data/sitc-rev2/parquet/2019/SEN/MOZ/part-5353.parquet'. Detail: [errno 24] Too many open files