Uploaded image for project: 'SystemDS'
  1. SystemDS
  2. SYSTEMDS-1011

Slow sparse append cbind (sparse row re-allocations)

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • SystemML 0.11
    • None
    • None

    Description

      All algorithms that support the 'intercept' option (e.g., LinregCG, LinregDS, L2SVM, MSVM, Mlogreg, and GLM) append a column of 1s in the beginning of the script. On large sparse data, this append sometimes dominates end-to-end performance. For example, here are the LinregCG results for a 10Mx1K scenario with sparsity 0.01.

      -- Running runLinearRegCG on 10M_1k_sparse (all configs)
      LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7
      LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15
      LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 15
      -- Running runLinearRegCG on 10M_1k_sparse (all configs)
      LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7
      LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15
      LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 15
      -- Running runLinearRegCG on 10M_1k_sparse (all configs)
      LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 6
      LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15
      LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 16
      -- Running runLinearRegCG on 10M_1k_sparse (all configs)
      LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7
      LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 16
      LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 15
      

      and here is the related -stats output for ict=1.

      Total elapsed time:		16.893 sec.
      Total compilation time:		2.412 sec.
      Total execution time:		14.480 sec.
      Number of compiled Spark inst:	0.
      Number of executed Spark inst:	0.
      Cache hits (Mem, WB, FS, HDFS):	172/0/0/2.
      Cache writes (WB, FS, HDFS):	77/0/1.
      Cache times (ACQr/m, RLS, EXP):	1.734/0.003/2.143/0.209 sec.
      HOP DAGs recompiled (PRED, SB):	0/0.
      HOP DAGs recompile time:	0.000 sec.
      Spark ctx create time (lazy):	0.000 sec.
      Spark trans counts (par,bc,col):0/0/0.
      Spark trans times (par,bc,col):	0.000/0.000/0.000 secs.
      Total JIT compile time:		5.357 sec.
      Total JVM GC count:		2.
      Total JVM GC time:		5.628 sec.
      Heavy hitter instructions (name, time, count):
      -- 1) 	append 	8.595 sec 	26
      -- 2) 	mmchain 	4.443 sec 	8
      -- 3) 	ba+* 	0.537 sec 	10
      -- 4) 	r' 	0.411 sec 	10
      -- 5) 	write 	0.210 sec 	1
      -- 6) 	- 	0.087 sec 	20
      -- 7) 	uak+ 	0.059 sec 	2
      -- 8) 	tsmm 	0.049 sec 	11
      -- 9) 	rand 	0.043 sec 	5
      -- 10) 	+* 	0.007 sec 	24
      

      The large GC time indicates that sparse row re-allocations are a major issue here. We should compute the joint nnz per output row, and allocate the output sparse row just once.

      Attachments

        Issue Links

          Activity

            mboehm7 Matthias Boehm added a comment -

            with an allocate-once policy we see small end-to-end improvements (as shown below), but especially for sparse, we should execute these append operations multi-threaded (SYSTEMML-1012), which has the potential of mitigating allocation inefficiencies.

            -- Running runLinearRegCG on 10M_1k_sparse (all configs)
            LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7
            LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 14
            LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 14
            -- Running runLinearRegCG on 10M_1k_sparse (all configs)
            LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7
            LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15
            LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 14
            -- Running runLinearRegCG on 10M_1k_sparse (all configs)
            LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7
            LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 14
            LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 14
            -- Running runLinearRegCG on 10M_1k_sparse (all configs)
            LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 8
            LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 14
            LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 14
            
            mboehm7 Matthias Boehm added a comment - with an allocate-once policy we see small end-to-end improvements (as shown below), but especially for sparse, we should execute these append operations multi-threaded ( SYSTEMML-1012 ), which has the potential of mitigating allocation inefficiencies. -- Running runLinearRegCG on 10M_1k_sparse (all configs) LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7 LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 14 LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 14 -- Running runLinearRegCG on 10M_1k_sparse (all configs) LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7 LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15 LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 14 -- Running runLinearRegCG on 10M_1k_sparse (all configs) LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7 LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 14 LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 14 -- Running runLinearRegCG on 10M_1k_sparse (all configs) LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 8 LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 14 LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 14

            People

              mboehm7 Matthias Boehm
              mboehm7 Matthias Boehm
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: