Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Resolved
-
None
-
None
Description
Imran is trying to read in a csv file which encodes a matrix.
The dimensions of the matrix are 10000 x 784
Here is what the metadata looks like:
{
"data_type": "matrix",
"value_type": "double",
"rows": 10000,
"cols": 784,
"format": "csv",
"header": false,
"sep": ",",
"description":
}
There are two ways I read this matrix into the tSNE algorithm as variable X.
1. The csv file is read into a dataframe and then the MLContext api is used to convert it to the binary block format. Here is the code used to create the dataframe:
val X0 = {sqlContext.read .format("com.databricks.spark.csv") .option("header", "false") // Use first line of all files as header .option("inferSchema", "true") // Automatically infer data types .load("hdfs://rr-dense1/user/iyounus/data/mnist_test.csv") } // code for tsne .......... val X = X0.drop(X0.col("C784")) // Drop last column val tsneScript = dml(tsne).in("X", X).out("Y", "C") println(Calendar.getInstance().get(Calendar.MINUTE)) val res = ml.execute(tsneScript)
2. A DML read statement reads the csv file directly.
Option 2 happens almost instantaneously. Option 1 takes about 8 minutes on a beefy machine with 40g allocated to the driver and 60g allocated to the executors on a 7 node cluster.
Looking at the spark web UI and following the code path, it leads us to this function:
https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/utils/RDDConverterUtilsExt.java#L874-L879