[ARROW-12321] [R][C++] Arrow opens too many files at once when writing a dataset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.0.0
Fix Version/s: 6.0.0
Component/s: C++, R
Labels:
- query-engine

External issue URL:
https://github.com/apache/arrow/issues/18606

Description

Related to: https://issues.apache.org/jira/browse/ARROW-12315

Please see https://drive.google.com/drive/folders/1e7WB36FPYzvdtm46dgAEEFAKAWDQs-e1?usp=sharing where I added the raw data and the output.

This works:


library(data.table)
library(dplyr)
library(arrow)

d <- fread(
        input = "01-raw-data/sitc-rev2/parquet/type-C_r-ALL_ps-2019_freq-A_px-S2_pub-20210216_fmt-csv_ex-20210227.csv",
        colClasses = list(
          character = "Commodity Code",
          numeric = c("Trade Value (US$)", "Qty", "Netweight (kg)")
        ))

d <- d %>%
  mutate(
    `Reporter ISO` = case_when(
      `Reporter ISO` %in% c(NA, "", " ") ~ "0-unspecified",
      TRUE ~ `Reporter ISO`
    ),
    `Partner ISO` = case_when(
      `Partner ISO` %in% c(NA, "", " ") ~ "0-unspecified",
      TRUE ~ `Partner ISO`
    )
  )

# d %>%
#   select(Year, `Reporter ISO`, `Partner ISO`) %>%
#   distinct() %>%
#   dim()

d %>%
  group_by(Year, `Reporter ISO`) %>%
  write_dataset("parquet", hive_style = F, max_partitions = 1024L)

But, if I add an additional column for partioning and increases the max partitions to 12808 (to pass exactly the number of partitions that it needs), I get the error:

d %>%
  group_by(Year, `Reporter ISO`) %>%
  write_dataset("parquet", hive_style = F, max_partitions = 12808)

Error: IOError: Failed to open local file '/media/pacha/pacha_backup/tradestatistics/yearly-datasets-arrow/01-raw-data/sitc-rev2/parquet/2019/SEN/MOZ/part-5353.parquet'. Detail: [errno 24] Too many open files

Attachments

Activity

People

Assignee:: Weston Pace

Reporter:: Mauricio 'Pachá' Vargas Sepúlveda

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 09/Apr/21 22:23

Updated:: 11/Jan/23 08:25

Resolved:: 20/Dec/21 17:11