Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-12321

[R][C++] Arrow opens too many files at once when writing a dataset

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 3.0.0
    • 6.0.0
    • C++, R

    Description

      Related to: https://issues.apache.org/jira/browse/ARROW-12315

      Please see https://drive.google.com/drive/folders/1e7WB36FPYzvdtm46dgAEEFAKAWDQs-e1?usp=sharing where I added the raw data and the output.

      This works:

      
      library(data.table)
      library(dplyr)
      library(arrow)
      
      d <- fread(
              input = "01-raw-data/sitc-rev2/parquet/type-C_r-ALL_ps-2019_freq-A_px-S2_pub-20210216_fmt-csv_ex-20210227.csv",
              colClasses = list(
                character = "Commodity Code",
                numeric = c("Trade Value (US$)", "Qty", "Netweight (kg)")
              ))
      
      d <- d %>%
        mutate(
          `Reporter ISO` = case_when(
            `Reporter ISO` %in% c(NA, "", " ") ~ "0-unspecified",
            TRUE ~ `Reporter ISO`
          ),
          `Partner ISO` = case_when(
            `Partner ISO` %in% c(NA, "", " ") ~ "0-unspecified",
            TRUE ~ `Partner ISO`
          )
        )
      
      # d %>%
      #   select(Year, `Reporter ISO`, `Partner ISO`) %>%
      #   distinct() %>%
      #   dim()
      
      d %>%
        group_by(Year, `Reporter ISO`) %>%
        write_dataset("parquet", hive_style = F, max_partitions = 1024L)
      

      But, if I add an additional column for partioning and increases the max partitions to 12808 (to pass exactly the number of partitions that it needs), I get the error:

      d %>%
        group_by(Year, `Reporter ISO`) %>%
        write_dataset("parquet", hive_style = F, max_partitions = 12808)
      
      Error: IOError: Failed to open local file '/media/pacha/pacha_backup/tradestatistics/yearly-datasets-arrow/01-raw-data/sitc-rev2/parquet/2019/SEN/MOZ/part-5353.parquet'. Detail: [errno 24] Too many open files
      

      Attachments

        Activity

          People

            westonpace Weston Pace
            pachamaltese Mauricio 'PachĂĄ' Vargas SepĂșlveda
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: