Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
All algorithms that support the 'intercept' option (e.g., LinregCG, LinregDS, L2SVM, MSVM, Mlogreg, and GLM) append a column of 1s in the beginning of the script. On large sparse data, this append sometimes dominates end-to-end performance. For example, here are the LinregCG results for a 10Mx1K scenario with sparsity 0.01.
-- Running runLinearRegCG on 10M_1k_sparse (all configs) LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7 LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15 LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 15 -- Running runLinearRegCG on 10M_1k_sparse (all configs) LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7 LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15 LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 15 -- Running runLinearRegCG on 10M_1k_sparse (all configs) LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 6 LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15 LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 16 -- Running runLinearRegCG on 10M_1k_sparse (all configs) LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7 LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 16 LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 15
and here is the related -stats output for ict=1.
Total elapsed time: 16.893 sec. Total compilation time: 2.412 sec. Total execution time: 14.480 sec. Number of compiled Spark inst: 0. Number of executed Spark inst: 0. Cache hits (Mem, WB, FS, HDFS): 172/0/0/2. Cache writes (WB, FS, HDFS): 77/0/1. Cache times (ACQr/m, RLS, EXP): 1.734/0.003/2.143/0.209 sec. HOP DAGs recompiled (PRED, SB): 0/0. HOP DAGs recompile time: 0.000 sec. Spark ctx create time (lazy): 0.000 sec. Spark trans counts (par,bc,col):0/0/0. Spark trans times (par,bc,col): 0.000/0.000/0.000 secs. Total JIT compile time: 5.357 sec. Total JVM GC count: 2. Total JVM GC time: 5.628 sec. Heavy hitter instructions (name, time, count): -- 1) append 8.595 sec 26 -- 2) mmchain 4.443 sec 8 -- 3) ba+* 0.537 sec 10 -- 4) r' 0.411 sec 10 -- 5) write 0.210 sec 1 -- 6) - 0.087 sec 20 -- 7) uak+ 0.059 sec 2 -- 8) tsmm 0.049 sec 11 -- 9) rand 0.043 sec 5 -- 10) +* 0.007 sec 24
The large GC time indicates that sparse row re-allocations are a major issue here. We should compute the joint nnz per output row, and allocate the output sparse row just once.
Attachments
Issue Links
- Is contained by
-
SYSTEMDS-1010 Perftest 0.11 release and related improvements
- Resolved
- is related to
-
SYSTEMDS-1012 Multi-threaded append operations
- Open
with an allocate-once policy we see small end-to-end improvements (as shown below), but especially for sparse, we should execute these append operations multi-threaded (SYSTEMML-1012), which has the potential of mitigating allocation inefficiencies.