Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-14736

[C++][R]Opening a multi-file dataset and writing a re-partitioned version of it fails

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 6.0.0
    • None
    • C++, R
    • M1 Mac, macOS Monterey 12.0.1, 16Gb RAM
      R 4.1.1, {arrow} R package 6.0.0.2 (release) & 6.0.0.9000 (dev)

    Description

      Attempting to open a multi-file dataset and write a re-partitioned version of it fails as it seems there is an attempt to collect data into memory first. This happens both for wide and long data.

      Steps to reproduce the issue:
      1. Create a large dataset (100k columns, 300k rows) and write it to disk and create 20 copies of it. Each file will have a footprint of roughly 7.5GB. 

      library(arrow)
      library(dplyr)
      library(fs)
      
      rows <- 300000
      cols <- 100000
      partitions <- 20
      
      wide_df <- as.data.frame(
        matrix(
          sample(1:32767, rows * cols / partitions, replace = TRUE), 
          ncol = cols)
      )
      
      schem <- sapply(colnames(wide_df), function(nm) {int16()})
      schem <- do.call(schema, schem)
      
      wide_tab <- Table$create(wide_df, schema = schem)
      
      write_parquet(wide_tab, "~/Documents/arrow_playground/wide.parquet")
      
      fs::dir_create("~/Documents/arrow_playground/wide_ds")
      for (i in seq_len(partitions)) {
        file.copy("~/Documents/arrow_playground/wide.parquet", 
                  glue::glue("~/Documents/arrow_playground/wide_ds/wide-{i-1}.parquet"))
      }
      
      ds_wide <- open_dataset("~/Documents/arrow_playground/wide_ds/")
      

      All the following steps fail:

      2. Creating and writing a partitioned version of ds_wide.

        ds_wide %>%
          mutate(grouper = round(V1 / 1024)) %>%
          write_dataset("~/Documents/arrow_playground/partitioned", 
                         partitioning = "grouper",
                         format = "parquet")
      

      3. Writing a non-partitioned dataset:

        ds_wide %>%
          write_dataset("~/Documents/arrow_playground/partitioned", 
                        format = "parquet")
      

      4. Creating the partitioning variable first and then attempting to write:

        ds2 <- ds_wide %>% 
          mutate(grouper = round(V1 / 1024))
      
        ds2 %>% 
          write_dataset("~/Documents/arrow_playground/partitioned", 
                        partitioning = "grouper", 
                        format = "parquet")  
      

      5. Attempting to write to csv:

      ds_wide %>% 
        write_dataset("~/Documents/arrow_playground/csv_writing/test.csv",
                      format = "csv")
      

      None of the failures seem to originate in R code and they all result in a similar behaviour: the R sessions consume increasing amounts of RAM until they crash.

      Attachments

        1. image-2021-11-17-14-55-08-597.png
          55 kB
          Dragoș Moldovan-Grünfeld
        2. image-2021-11-17-14-54-42-747.png
          55 kB
          Dragoș Moldovan-Grünfeld
        3. image-2021-11-17-14-43-37-127.png
          63 kB
          Dragoș Moldovan-Grünfeld

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dragosmg Dragoș Moldovan-Grünfeld
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: