Description
Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 installation as necessary):
SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7") Sys.setenv(SPARK_HOME = SPARK_HOME) library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g")) n <- 1E3 df <- as.data.frame(replicate(n, 1L, FALSE)) names(df) <- paste("X", 1:n, sep = "") tbl <- as.DataFrame(df) cache(tbl) # works fine without this cl <- collect(tbl) identical(df, cl) # FALSE
Although this is reproducible with SparkR, it seems more likely that this is an error in the Java / Scala Spark sources.
For posterity:
> sessionInfo()
R version 3.3.1 Patched (2016-07-30 r71015)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra (10.12)