Details
-
Umbrella
-
Status: Resolved
-
Major
-
Resolution: Done
-
3.0.0
Description
Arrow 0.12.0 is release and it contains R API. We could optimize Spark DaraFrame <> R DataFrame interoperability.
For instance see the examples below:
- dapply
df <- createDataFrame(mtcars)
collect(dapply(df,
function(r.data.frame) {
data.frame(r.data.frame$gear)
},
structType("gear long")))
- gapply
df <- createDataFrame(mtcars) collect(gapply(df, "gear", function(key, group) { data.frame(gear = key[[1]], disp = mean(group$disp) > group$disp) }, structType("gear double, disp boolean")))
- R DataFrame -> Spark DataFrame
createDataFrame(mtcars)
- Spark DataFrame -> R DataFrame
collect(df) head(df)
Currently, some of communication path between R side and JVM side has to buffer the data and flush it at once due to ARROW-4512. I don't target to fix it under this umbrella.
Attachments
Issue Links
- is related to
-
SPARK-27834 Make separate PySpark/SparkR vectorization configurations
- Resolved
- relates to
-
SPARK-21187 Complete support for remaining Spark data types in Arrow Converters
- Resolved