Details
-
Task
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
The performance experiments for our 0.11 release, revealed performance issues for LinregDS and PCA (specifically for t(X)%*%X) whenever the number of columns is larger than the blocksize. For example, the following scenario shows LinregDS results for an input size of 10M x 1K with blocksize of 1K. For scenarios with icp>0, we append a column of ones which exceeds the blocksize and hence we compile a cpmm instead of tsmm instruction.
-- Running runLinearRegDS on 10M_1k_dense (all configs) LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 80 LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 293 LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 340 -- Running runLinearRegDS on 10M_1k_dense (all configs) LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 80 LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 291 LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 302 -- Running runLinearRegDS on 10M_1k_dense (all configs) LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 80 LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 274 LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 316 -- Running runLinearRegDS on 10M_1k_dense (all configs) LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 81 LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 279 LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 322
In comparison, LinregCG shows much more robust experimental results:
-- Running runLinearRegCG on 10M_1k_dense (all configs) LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_dense: 62 LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_dense: 67 LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_dense: 65 -- Running runLinearRegCG on 10M_1k_dense (all configs) LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_dense: 57 LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_dense: 68 LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_dense: 58 -- Running runLinearRegCG on 10M_1k_dense (all configs) LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_dense: 50 LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_dense: 72 LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_dense: 59 -- Running runLinearRegCG on 10M_1k_dense (all configs) LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_dense: 57 LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_dense: 67 LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_dense: 67
We should introduce a new tsmm2 operation for the scenario where the excess columns fit into the broadcast memory budget, which would allow us to compute this expression without shuffling t(X) and X.
Attachments
Issue Links
- Is contained by
-
SYSTEMDS-1010 Perftest 0.11 release and related improvements
- Resolved
with this new tsmm2 operator, the end-to-end runtime for this scenario are much smoother: