Uploaded image for project: 'SystemDS'
  1. SystemDS
  2. SYSTEMDS-1004

New spark tsmm2 matrix multiplication operator

Details

    • Task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • SystemML 0.11
    • None
    • None

    Description

      The performance experiments for our 0.11 release, revealed performance issues for LinregDS and PCA (specifically for t(X)%*%X) whenever the number of columns is larger than the blocksize. For example, the following scenario shows LinregDS results for an input size of 10M x 1K with blocksize of 1K. For scenarios with icp>0, we append a column of ones which exceeds the blocksize and hence we compile a cpmm instead of tsmm instruction.

      -- Running runLinearRegDS on 10M_1k_dense (all configs)
      LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 80
      LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 293
      LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 340
      -- Running runLinearRegDS on 10M_1k_dense (all configs)
      LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 80
      LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 291
      LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 302
      -- Running runLinearRegDS on 10M_1k_dense (all configs)
      LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 80
      LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 274
      LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 316
      -- Running runLinearRegDS on 10M_1k_dense (all configs)
      LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 81
      LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 279
      LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 322
      

      In comparison, LinregCG shows much more robust experimental results:

      -- Running runLinearRegCG on 10M_1k_dense (all configs)
      LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_dense: 62
      LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_dense: 67
      LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_dense: 65
      -- Running runLinearRegCG on 10M_1k_dense (all configs)
      LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_dense: 57
      LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_dense: 68
      LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_dense: 58
      -- Running runLinearRegCG on 10M_1k_dense (all configs)
      LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_dense: 50
      LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_dense: 72
      LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_dense: 59
      -- Running runLinearRegCG on 10M_1k_dense (all configs)
      LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_dense: 57
      LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_dense: 67
      LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_dense: 67
      

      We should introduce a new tsmm2 operation for the scenario where the excess columns fit into the broadcast memory budget, which would allow us to compute this expression without shuffling t(X) and X.

      Attachments

        Issue Links

          Activity

            mboehm7 Matthias Boehm added a comment -

            with this new tsmm2 operator, the end-to-end runtime for this scenario are much smoother:

            -- Running runLinearRegDS on 10M_1k_dense (all configs)
            LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 81
            LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 90
            LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 93
            -- Running runLinearRegDS on 10M_1k_dense (all configs)
            LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 80
            LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 89
            LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 93
            -- Running runLinearRegDS on 10M_1k_dense (all configs)
            LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 81
            LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 89
            LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 93
            -- Running runLinearRegDS on 10M_1k_dense (all configs)
            LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 81
            LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 94
            LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 94
            
            mboehm7 Matthias Boehm added a comment - with this new tsmm2 operator, the end-to-end runtime for this scenario are much smoother: -- Running runLinearRegDS on 10M_1k_dense (all configs) LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 81 LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 90 LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 93 -- Running runLinearRegDS on 10M_1k_dense (all configs) LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 80 LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 89 LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 93 -- Running runLinearRegDS on 10M_1k_dense (all configs) LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 81 LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 89 LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 93 -- Running runLinearRegDS on 10M_1k_dense (all configs) LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 81 LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 94 LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 94

            People

              mboehm7 Matthias Boehm
              mboehm7 Matthias Boehm
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: