Details
-
Improvement
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
None
-
None
-
None
Description
Currently, the performance of vector DataFrame conversions leaves much to be desired, with regards to frequent OOM errors, and overall slow performance.
Scenario:
- Spark DataFrame:
- 3,745,888 rows
- Each row contains one vector column, where the vector is dense and of length 65,539. Note: This is a 256x256 pixel image stretched out and appended with a 3-column one-hot encoded label. I.e. this is a forced workaround to get both the labels and features into SystemML as efficiently as possible.
- SystemML script + MLContext invocation (Python): Simply accept the DataFrame as input and save as a SystemML matrix in binary form. Note: I'm not grabbing the output here, so the matrix will literally be written by SystemML.
script = """ write(train, "train", format="binary") """ script = dml(script).input(train=train_YX) ml.execute(script)
I'm seeing large amounts of memory being used in conversion during the mapPartitionsToPair at RDDConverterUtils.java:311 stage. For example, I have a scenario where it read in 1493.2 GB as "Input", and performed a "Shuffle Write" of 2.5 TB. A subsequent stage of saveAsHadoopFile at WriteSPInstruction.java:261 then did a "Shuffle Read" of 2.5TB, and "Output" 1829.1 GB. This was for a simple script that took in DataFrames with a vector column and wrote to disk in binary format. It kept running out of heap space memory, so I kept increasing the executor memory 3x until it finally ran. Additionally, the latter stage had a very skewed execution time across the partitions, with ~1hour for the first 1000 paritions (out of 20,000), ~20 minutes for the next 18,000 partitions, and ~1 hour for the final 1000 partitions. The passed in DataFrame had an average of 180 rows per partition with a max of 215, and a min of 155.
cc mboehm7
cc acs_s, nakul02