Uploaded image for project: 'SystemDS'
  1. SystemDS
  2. SYSTEMDS-1242

Perftest: OutOfMemoryError in MultiLogReg for 80g sparse

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • SystemML 0.11, SystemML 0.12
    • SystemML 0.13
    • Test
    • None
    • spark 2.1.0

    Description

      when running runMultiLogReg.sh script, MultiLogReg.dml ends with OutOfMemory error for the case of 10M_1K sparse data and icpt = 1. Here is the end of the log file:

      17/02/04 17:20:33 INFO api.DMLScript: SystemML Statistics:
      Total elapsed time:		697.694 sec.
      Total compilation time:		2.543 sec.
      Total execution time:		695.151 sec.
      Number of compiled Spark inst:	73.
      Number of executed Spark inst:	16.
      Cache hits (Mem, WB, FS, HDFS):	46/9/1/7.
      Cache writes (WB, FS, HDFS):	27/1/1.
      Cache times (ACQr/m, RLS, EXP):	281.541/0.003/131.589/48.737 sec.
      HOP DAGs recompiled (PRED, SB):	0/15.
      HOP DAGs recompile time:	0.067 sec.
      Spark ctx create time (lazy):	31.078 sec.
      Spark trans counts (par,bc,col):5/4/0.
      Spark trans times (par,bc,col):	46.748/0.392/0.000 secs.
      Total JIT compile time:		151.254 sec.
      Total JVM GC count:		144.
      Total JVM GC time:		220.671 sec.
      Heavy hitter instructions (name, time, count):
      -- 1) 	ba+* 	144.194 sec 	3
      -- 2) 	rand 	109.939 sec 	9
      -- 3) 	uark+ 	105.011 sec 	2
      -- 4) 	r' 	100.933 sec 	3
      -- 5) 	sp_/ 	80.387 sec 	1
      -- 6) 	sp_mapmm 	45.491 sec 	2
      -- 7) 	sp_tak+* 	40.655 sec 	1
      -- 8) 	append 	9.480 sec 	1
      -- 9) 	rangeReIndex 	7.347 sec 	2
      -- 10) 	sp_- 	6.392 sec 	3
      
      17/02/04 17:20:33 INFO api.DMLScript: END DML run 02/04/2017 17:20:33
      Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
      	at org.apache.sysml.runtime.matrix.data.MatrixBlock.allocateDenseBlock(MatrixBlock.java:363)
      	at org.apache.sysml.runtime.matrix.data.MatrixBlock.allocateDenseBlock(MatrixBlock.java:339)
      	at org.apache.sysml.runtime.matrix.data.MatrixBlock.allocateDenseBlockUnsafe(MatrixBlock.java:408)
      	at org.apache.sysml.runtime.io.MatrixReader.createOutputMatrixBlock(MatrixReader.java:107)
      	at org.apache.sysml.runtime.io.ReaderBinaryBlockParallel.readMatrixFromHDFS(ReaderBinaryBlockParallel.java:59)
      	at org.apache.sysml.runtime.util.DataConverter.readMatrixFromHDFS(DataConverter.java:203)
      	at org.apache.sysml.runtime.util.DataConverter.readMatrixFromHDFS(DataConverter.java:168)
      	at org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromHDFS(MatrixObject.java:425)
      	at org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromHDFS(MatrixObject.java:60)
      	at org.apache.sysml.runtime.controlprogram.caching.CacheableData.readBlobFromHDFS(CacheableData.java:920)
      	at org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromRDD(MatrixObject.java:478)
      	at org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromRDD(MatrixObject.java:60)
      	at org.apache.sysml.runtime.controlprogram.caching.CacheableData.acquireRead(CacheableData.java:411)
      	at org.apache.sysml.runtime.controlprogram.context.ExecutionContext.getMatrixInput(ExecutionContext.java:209)
      	at org.apache.sysml.runtime.instructions.cp.AggregateBinaryCPInstruction.processInstruction(AggregateBinaryCPInstruction.java:74)
      	at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:290)
      	at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:221)
      	at org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:168)
      	at org.apache.sysml.runtime.controlprogram.IfProgramBlock.execute(IfProgramBlock.java:139)
      	at org.apache.sysml.runtime.controlprogram.WhileProgramBlock.execute(WhileProgramBlock.java:165)
      	at org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:123)
      	at org.apache.sysml.api.DMLScript.execute(DMLScript.java:684)
      	at org.apache.sysml.api.DMLScript.executeScript(DMLScript.java:360)
      	at org.apache.sysml.api.DMLScript.main(DMLScript.java:221)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
      	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
      	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
      	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
      17/02/04 17:20:33 INFO util.ShutdownHookManager: Shutdown hook called
      17/02/04 17:20:33 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-c8185e69-2eaa-4719-ab42-f6af0edcbbeb
      
      

      Attachments

        1. sparkDML.sh
          5 kB
          Imran Younus

        Issue Links

          Activity

            mboehm7 Matthias Boehm added a comment -

            ok this is interesting as the OOM does not come from the first read of our sparse input matrix but some intermediate. I don't have a cluster environment to reproduce this right now but will look into it in the next days.

            mboehm7 Matthias Boehm added a comment - ok this is interesting as the OOM does not come from the first read of our sparse input matrix but some intermediate. I don't have a cluster environment to reproduce this right now but will look into it in the next days.
            iyounus Imran Younus added a comment - - edited

            Attached the sparkDML.sh script with cluster settings used to run this test.

            iyounus Imran Younus added a comment - - edited Attached the sparkDML.sh script with cluster settings used to run this test.
            acs_s Arvind Surve added a comment -

            iyounus is using/reserved Curly for his Performance testing. For now he is verifying results and waiting for 1242 & 1243 fixes.
            meantime, mboehm7 you should able to use Curly to recreate this issue.

            acs_s Arvind Surve added a comment - iyounus is using/reserved Curly for his Performance testing. For now he is verifying results and waiting for 1242 & 1243 fixes. meantime, mboehm7 you should able to use Curly to recreate this issue.
            mboehm7 Matthias Boehm added a comment -

            I assume this is referring to the multinomial uses cases, right? iyounus please correct me if I'm wrong.

            mboehm7 Matthias Boehm added a comment - I assume this is referring to the multinomial uses cases, right? iyounus please correct me if I'm wrong.
            iyounus Imran Younus added a comment -

            I'm not sure what do you mean by "uses cases"? If you're referring to the input parameter icpt, then yes. The performance test runs with three different values of icpt. For icpt=0, the test runs fine. But for icpt=1, I see OOM error in this jira. The test with icpt=2 did not run.

            iyounus Imran Younus added a comment - I'm not sure what do you mean by "uses cases"? If you're referring to the input parameter icpt , then yes. The performance test runs with three different values of icpt . For icpt=0 , the test runs fine. But for icpt=1 , I see OOM error in this jira. The test with icpt=2 did not run.
            mboehm7 Matthias Boehm added a comment -

            well our performance test suite runs MultiLogReg for two types of classification use cases: binomial (2 classes) and multinomial (150 classes).

            mboehm7 Matthias Boehm added a comment - well our performance test suite runs MultiLogReg for two types of classification use cases: binomial (2 classes) and multinomial (150 classes).
            mboehm7 Matthias Boehm added a comment -

            Meanwhile, I was able to reproduce this issue. After closer investigation, it occurs at HOP 925 in below recompilation output:

            GENERIC (lines 325-327):
            --(912) TRead B_new [1001,149,1000,1000,149149] [0,0,1 -> 1MB], CP
            --(913) TWrite B (912) [1001,149,1000,1000,149149] [1,0,0 -> 1MB], CP
            --(911) TRead P_new [10000000,150,1000,1000,-1] [0,0,11444 -> 11444MB], CP
            --(914) TWrite P (911) [10000000,150,1000,1000,-1] [11444,0,0 -> 11444MB], CP
            --(909) TRead X [10000000,1001,1000,1000,109994979] [0,0,2441 -> 2441MB], CP
            --(915) r(t) (909) [1001,10000000,1000,1000,109994979] [2441,0,1259 -> 3700MB], CP
            --(919) rix (911) [10000000,149,1000,1000,-1] [11444,0,11444 -> 22888MB], SPARK
            --(910) TRead Y [10000000,150,1000,1000,10000000] [0,0,1640 -> 1640MB], CP
            --(923) rix (910) [10000000,149,1000,1000,-1] [1640,0,1640 -> 3281MB], CP
            --(924) b(-) (919,923) [10000000,149,1000,1000,-1] [13084,0,11368 -> 24452MB], SPARK
            --(925) ba(+*) (915,924) [1001,149,1000,1000,-1] [12627,1,1 -> 12629MB], CP
            --(926) TWrite Grad (925) [1001,149,1000,1000,-1] [1,0,0 -> 1MB], CP
            

            We have a memory budget of 14157MB and because the right input to this ba+* (of size 11368MB) does not fit twice in local memory, we go correctly through a guarded collect (i.e., pipe the RDD to HDFS and subsequently try to read it in). However, while it goes through the first iteration, on the second iteration, it runs out of memory due to additional memory consumers such as previously parallelized RDDs, broadcasts and a full buffer pool.

            There are multiple actions, we can and should take here (created as substasks):

            1) Revisit the existing 'transitive execution type selection' conditions to reduce memory pressure. In above example, we should actually pull the matrix multiplication transitively into SPARK - this would not just reduce memory pressure but also reduce the size of transferred intermediates. So far we do this only for the left input of matrix multiplications.

            2) Keep track of referenced parallelized RDDs and broadcasts with appropriate fallbacks once they do not fit together into local memory. This is similar to keeping track of pinned variables as already done in CacheableData.

            3) Approach for handling RDD collects with unknown sparsity: While debugging this use case, I also came across a couple of instances where we unnecessarily collect sparse matrix RDDs into dense matrix block because the nnz are unknown as the RDD is not yet computed. However, in many cases the compiler already knows a worst-case sparsity estimate - we could propagate these in a best effort manner through spark instructions.

            mboehm7 Matthias Boehm added a comment - Meanwhile, I was able to reproduce this issue. After closer investigation, it occurs at HOP 925 in below recompilation output: GENERIC (lines 325-327): --(912) TRead B_new [1001,149,1000,1000,149149] [0,0,1 -> 1MB], CP --(913) TWrite B (912) [1001,149,1000,1000,149149] [1,0,0 -> 1MB], CP --(911) TRead P_new [10000000,150,1000,1000,-1] [0,0,11444 -> 11444MB], CP --(914) TWrite P (911) [10000000,150,1000,1000,-1] [11444,0,0 -> 11444MB], CP --(909) TRead X [10000000,1001,1000,1000,109994979] [0,0,2441 -> 2441MB], CP --(915) r(t) (909) [1001,10000000,1000,1000,109994979] [2441,0,1259 -> 3700MB], CP --(919) rix (911) [10000000,149,1000,1000,-1] [11444,0,11444 -> 22888MB], SPARK --(910) TRead Y [10000000,150,1000,1000,10000000] [0,0,1640 -> 1640MB], CP --(923) rix (910) [10000000,149,1000,1000,-1] [1640,0,1640 -> 3281MB], CP --(924) b(-) (919,923) [10000000,149,1000,1000,-1] [13084,0,11368 -> 24452MB], SPARK --(925) ba(+*) (915,924) [1001,149,1000,1000,-1] [12627,1,1 -> 12629MB], CP --(926) TWrite Grad (925) [1001,149,1000,1000,-1] [1,0,0 -> 1MB], CP We have a memory budget of 14157MB and because the right input to this ba+* (of size 11368MB) does not fit twice in local memory, we go correctly through a guarded collect (i.e., pipe the RDD to HDFS and subsequently try to read it in). However, while it goes through the first iteration, on the second iteration, it runs out of memory due to additional memory consumers such as previously parallelized RDDs, broadcasts and a full buffer pool. There are multiple actions, we can and should take here (created as substasks): 1) Revisit the existing 'transitive execution type selection' conditions to reduce memory pressure. In above example, we should actually pull the matrix multiplication transitively into SPARK - this would not just reduce memory pressure but also reduce the size of transferred intermediates. So far we do this only for the left input of matrix multiplications. 2) Keep track of referenced parallelized RDDs and broadcasts with appropriate fallbacks once they do not fit together into local memory. This is similar to keeping track of pinned variables as already done in CacheableData. 3) Approach for handling RDD collects with unknown sparsity: While debugging this use case, I also came across a couple of instances where we unnecessarily collect sparse matrix RDDs into dense matrix block because the nnz are unknown as the RDD is not yet computed. However, in many cases the compiler already knows a worst-case sparsity estimate - we could propagate these in a best effort manner through spark instructions.
            mboehm7 Matthias Boehm added a comment -

            ok with the changes in SYSTEMML-1261 the script runs now through but we should realize the other two items as well.

            mboehm7 Matthias Boehm added a comment - ok with the changes in SYSTEMML-1261 the script runs now through but we should realize the other two items as well.
            mboehm7 Matthias Boehm added a comment -

            gweidner do you still have the logs from the runs for our 0.12 release? Looking over the results from SYSTEMML-1217 it appears that the tests for intercept 1 and 2 failed there to.

            mboehm7 Matthias Boehm added a comment - gweidner do you still have the logs from the runs for our 0.12 release? Looking over the results from SYSTEMML-1217 it appears that the tests for intercept 1 and 2 failed there to.
            gweidner Glenn Weidner added a comment -

            Yes mboehm7 - I do still have the full logs for runMultiLogReg_10M_1k_sparse_k150 for both 0.12 and 0.11 used for comparison. I've attached the logs to SYSTEMML-1217 for MultiLogReg sparse for both 0.12 and 0.11 on same cluster. The runMultiLogReg.sh that was used for the logs also added to SYSTEMML-1217. Additional logs for other cases are available if needed.

            gweidner Glenn Weidner added a comment - Yes mboehm7 - I do still have the full logs for runMultiLogReg_10M_1k_sparse_k150 for both 0.12 and 0.11 used for comparison. I've attached the logs to SYSTEMML-1217 for MultiLogReg sparse for both 0.12 and 0.11 on same cluster. The runMultiLogReg.sh that was used for the logs also added to SYSTEMML-1217 . Additional logs for other cases are available if needed.
            mboehm7 Matthias Boehm added a comment -

            Thanks gweidner - that explains it. So the same issue affects also 0.11 and 0.12 (and probably earlier releases too as we usually tested against something like 20 classes).

            mboehm7 Matthias Boehm added a comment - Thanks gweidner - that explains it. So the same issue affects also 0.11 and 0.12 (and probably earlier releases too as we usually tested against something like 20 classes).
            iyounus Imran Younus added a comment -

            mboehm7 I ran this test again after your fix, and it completed successfully. Should we close this jira now?

            iyounus Imran Younus added a comment - mboehm7 I ran this test again after your fix, and it completed successfully. Should we close this jira now?
            mboehm7 Matthias Boehm added a comment -

            sounds good - thanks for confirming iyounus

            mboehm7 Matthias Boehm added a comment - sounds good - thanks for confirming iyounus
            mboehm7 Matthias Boehm added a comment -

            I'm closing this issue as the new transitive execution type selection fixed the specific OOM - the other robustness features will be addressed in separate JIRAs.

            mboehm7 Matthias Boehm added a comment - I'm closing this issue as the new transitive execution type selection fixed the specific OOM - the other robustness features will be addressed in separate JIRAs.

            People

              mboehm7 Matthias Boehm
              iyounus Imran Younus
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: