[SYSTEMDS-1242] Perftest: OutOfMemoryError in MultiLogReg for 80g sparse - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: SystemML 0.11, SystemML 0.12
Fix Version/s: SystemML 0.13
Component/s: Test
Labels:
None
Environment:
spark 2.1.0

Description

when running runMultiLogReg.sh script, MultiLogReg.dml ends with OutOfMemory error for the case of 10M_1K sparse data and icpt = 1. Here is the end of the log file:

17/02/04 17:20:33 INFO api.DMLScript: SystemML Statistics:
Total elapsed time:		697.694 sec.
Total compilation time:		2.543 sec.
Total execution time:		695.151 sec.
Number of compiled Spark inst:	73.
Number of executed Spark inst:	16.
Cache hits (Mem, WB, FS, HDFS):	46/9/1/7.
Cache writes (WB, FS, HDFS):	27/1/1.
Cache times (ACQr/m, RLS, EXP):	281.541/0.003/131.589/48.737 sec.
HOP DAGs recompiled (PRED, SB):	0/15.
HOP DAGs recompile time:	0.067 sec.
Spark ctx create time (lazy):	31.078 sec.
Spark trans counts (par,bc,col):5/4/0.
Spark trans times (par,bc,col):	46.748/0.392/0.000 secs.
Total JIT compile time:		151.254 sec.
Total JVM GC count:		144.
Total JVM GC time:		220.671 sec.
Heavy hitter instructions (name, time, count):
-- 1) 	ba+* 	144.194 sec 	3
-- 2) 	rand 	109.939 sec 	9
-- 3) 	uark+ 	105.011 sec 	2
-- 4) 	r' 	100.933 sec 	3
-- 5) 	sp_/ 	80.387 sec 	1
-- 6) 	sp_mapmm 	45.491 sec 	2
-- 7) 	sp_tak+* 	40.655 sec 	1
-- 8) 	append 	9.480 sec 	1
-- 9) 	rangeReIndex 	7.347 sec 	2
-- 10) 	sp_- 	6.392 sec 	3

17/02/04 17:20:33 INFO api.DMLScript: END DML run 02/04/2017 17:20:33
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at org.apache.sysml.runtime.matrix.data.MatrixBlock.allocateDenseBlock(MatrixBlock.java:363)
	at org.apache.sysml.runtime.matrix.data.MatrixBlock.allocateDenseBlock(MatrixBlock.java:339)
	at org.apache.sysml.runtime.matrix.data.MatrixBlock.allocateDenseBlockUnsafe(MatrixBlock.java:408)
	at org.apache.sysml.runtime.io.MatrixReader.createOutputMatrixBlock(MatrixReader.java:107)
	at org.apache.sysml.runtime.io.ReaderBinaryBlockParallel.readMatrixFromHDFS(ReaderBinaryBlockParallel.java:59)
	at org.apache.sysml.runtime.util.DataConverter.readMatrixFromHDFS(DataConverter.java:203)
	at org.apache.sysml.runtime.util.DataConverter.readMatrixFromHDFS(DataConverter.java:168)
	at org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromHDFS(MatrixObject.java:425)
	at org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromHDFS(MatrixObject.java:60)
	at org.apache.sysml.runtime.controlprogram.caching.CacheableData.readBlobFromHDFS(CacheableData.java:920)
	at org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromRDD(MatrixObject.java:478)
	at org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromRDD(MatrixObject.java:60)
	at org.apache.sysml.runtime.controlprogram.caching.CacheableData.acquireRead(CacheableData.java:411)
	at org.apache.sysml.runtime.controlprogram.context.ExecutionContext.getMatrixInput(ExecutionContext.java:209)
	at org.apache.sysml.runtime.instructions.cp.AggregateBinaryCPInstruction.processInstruction(AggregateBinaryCPInstruction.java:74)
	at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:290)
	at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:221)
	at org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:168)
	at org.apache.sysml.runtime.controlprogram.IfProgramBlock.execute(IfProgramBlock.java:139)
	at org.apache.sysml.runtime.controlprogram.WhileProgramBlock.execute(WhileProgramBlock.java:165)
	at org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:123)
	at org.apache.sysml.api.DMLScript.execute(DMLScript.java:684)
	at org.apache.sysml.api.DMLScript.executeScript(DMLScript.java:360)
	at org.apache.sysml.api.DMLScript.main(DMLScript.java:221)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
17/02/04 17:20:33 INFO util.ShutdownHookManager: Shutdown hook called
17/02/04 17:20:33 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-c8185e69-2eaa-4719-ab42-f6af0edcbbeb

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

sparkDML.sh
10/Feb/17 18:46
5 kB
Imran Younus

Issue Links

is related to

SYSTEMDS-1243 Perftest: OutOfMemoryError in stratstats.dml for 800MB case

Resolved

Sub-Tasks

1.	Fix transitive Spark execution type selection for ba+*	Closed	Matthias Boehm
2.	Keep track of parallelized RDDs and broadcasts	Closed	Matthias Boehm
3.	Handling RDD collects with unknown sparsity	Open	Unassigned

Activity

Ascending order - Click to sort in descending order

Matthias Boehm added a comment - 10/Feb/17 16:04

ok this is interesting as the OOM does not come from the first read of our sparse input matrix but some intermediate. I don't have a cluster environment to reproduce this right now but will look into it in the next days.

Matthias Boehm added a comment - 10/Feb/17 16:04 ok this is interesting as the OOM does not come from the first read of our sparse input matrix but some intermediate. I don't have a cluster environment to reproduce this right now but will look into it in the next days.

Imran Younus added a comment - 10/Feb/17 18:46 - edited

Attached the sparkDML.sh script with cluster settings used to run this test.

Imran Younus added a comment - 10/Feb/17 18:46 - edited Attached the sparkDML.sh script with cluster settings used to run this test.

Arvind Surve added a comment - 10/Feb/17 18:58

iyounus is using/reserved Curly for his Performance testing. For now he is verifying results and waiting for 1242 & 1243 fixes.
meantime, mboehm7 you should able to use Curly to recreate this issue.

Arvind Surve added a comment - 10/Feb/17 18:58 iyounus is using/reserved Curly for his Performance testing. For now he is verifying results and waiting for 1242 & 1243 fixes. meantime, mboehm7 you should able to use Curly to recreate this issue.

Matthias Boehm added a comment - 14/Feb/17 00:58

I assume this is referring to the multinomial uses cases, right? iyounus please correct me if I'm wrong.

Matthias Boehm added a comment - 14/Feb/17 00:58 I assume this is referring to the multinomial uses cases, right? iyounus please correct me if I'm wrong.

Imran Younus added a comment - 14/Feb/17 03:21

I'm not sure what do you mean by "uses cases"? If you're referring to the input parameter icpt, then yes. The performance test runs with three different values of icpt. For icpt=0, the test runs fine. But for icpt=1, I see OOM error in this jira. The test with icpt=2 did not run.

Imran Younus added a comment - 14/Feb/17 03:21 I'm not sure what do you mean by "uses cases"? If you're referring to the input parameter icpt , then yes. The performance test runs with three different values of icpt . For icpt=0 , the test runs fine. But for icpt=1 , I see OOM error in this jira. The test with icpt=2 did not run.

Matthias Boehm added a comment - 14/Feb/17 04:28

well our performance test suite runs MultiLogReg for two types of classification use cases: binomial (2 classes) and multinomial (150 classes).

Matthias Boehm added a comment - 14/Feb/17 04:28 well our performance test suite runs MultiLogReg for two types of classification use cases: binomial (2 classes) and multinomial (150 classes).

Matthias Boehm added a comment - 14/Feb/17 06:46

Meanwhile, I was able to reproduce this issue. After closer investigation, it occurs at HOP 925 in below recompilation output:

GENERIC (lines 325-327):
--(912) TRead B_new [1001,149,1000,1000,149149] [0,0,1 -> 1MB], CP
--(913) TWrite B (912) [1001,149,1000,1000,149149] [1,0,0 -> 1MB], CP
--(911) TRead P_new [10000000,150,1000,1000,-1] [0,0,11444 -> 11444MB], CP
--(914) TWrite P (911) [10000000,150,1000,1000,-1] [11444,0,0 -> 11444MB], CP
--(909) TRead X [10000000,1001,1000,1000,109994979] [0,0,2441 -> 2441MB], CP
--(915) r(t) (909) [1001,10000000,1000,1000,109994979] [2441,0,1259 -> 3700MB], CP
--(919) rix (911) [10000000,149,1000,1000,-1] [11444,0,11444 -> 22888MB], SPARK
--(910) TRead Y [10000000,150,1000,1000,10000000] [0,0,1640 -> 1640MB], CP
--(923) rix (910) [10000000,149,1000,1000,-1] [1640,0,1640 -> 3281MB], CP
--(924) b(-) (919,923) [10000000,149,1000,1000,-1] [13084,0,11368 -> 24452MB], SPARK
--(925) ba(+*) (915,924) [1001,149,1000,1000,-1] [12627,1,1 -> 12629MB], CP
--(926) TWrite Grad (925) [1001,149,1000,1000,-1] [1,0,0 -> 1MB], CP

We have a memory budget of 14157MB and because the right input to this ba+* (of size 11368MB) does not fit twice in local memory, we go correctly through a guarded collect (i.e., pipe the RDD to HDFS and subsequently try to read it in). However, while it goes through the first iteration, on the second iteration, it runs out of memory due to additional memory consumers such as previously parallelized RDDs, broadcasts and a full buffer pool.

There are multiple actions, we can and should take here (created as substasks):

1) Revisit the existing 'transitive execution type selection' conditions to reduce memory pressure. In above example, we should actually pull the matrix multiplication transitively into SPARK - this would not just reduce memory pressure but also reduce the size of transferred intermediates. So far we do this only for the left input of matrix multiplications.

2) Keep track of referenced parallelized RDDs and broadcasts with appropriate fallbacks once they do not fit together into local memory. This is similar to keeping track of pinned variables as already done in CacheableData.

3) Approach for handling RDD collects with unknown sparsity: While debugging this use case, I also came across a couple of instances where we unnecessarily collect sparse matrix RDDs into dense matrix block because the nnz are unknown as the RDD is not yet computed. However, in many cases the compiler already knows a worst-case sparsity estimate - we could propagate these in a best effort manner through spark instructions.

Matthias Boehm added a comment - 14/Feb/17 06:46 Meanwhile, I was able to reproduce this issue. After closer investigation, it occurs at HOP 925 in below recompilation output: GENERIC (lines 325-327): --(912) TRead B_new [1001,149,1000,1000,149149] [0,0,1 -> 1MB], CP --(913) TWrite B (912) [1001,149,1000,1000,149149] [1,0,0 -> 1MB], CP --(911) TRead P_new [10000000,150,1000,1000,-1] [0,0,11444 -> 11444MB], CP --(914) TWrite P (911) [10000000,150,1000,1000,-1] [11444,0,0 -> 11444MB], CP --(909) TRead X [10000000,1001,1000,1000,109994979] [0,0,2441 -> 2441MB], CP --(915) r(t) (909) [1001,10000000,1000,1000,109994979] [2441,0,1259 -> 3700MB], CP --(919) rix (911) [10000000,149,1000,1000,-1] [11444,0,11444 -> 22888MB], SPARK --(910) TRead Y [10000000,150,1000,1000,10000000] [0,0,1640 -> 1640MB], CP --(923) rix (910) [10000000,149,1000,1000,-1] [1640,0,1640 -> 3281MB], CP --(924) b(-) (919,923) [10000000,149,1000,1000,-1] [13084,0,11368 -> 24452MB], SPARK --(925) ba(+*) (915,924) [1001,149,1000,1000,-1] [12627,1,1 -> 12629MB], CP --(926) TWrite Grad (925) [1001,149,1000,1000,-1] [1,0,0 -> 1MB], CP We have a memory budget of 14157MB and because the right input to this ba+* (of size 11368MB) does not fit twice in local memory, we go correctly through a guarded collect (i.e., pipe the RDD to HDFS and subsequently try to read it in). However, while it goes through the first iteration, on the second iteration, it runs out of memory due to additional memory consumers such as previously parallelized RDDs, broadcasts and a full buffer pool. There are multiple actions, we can and should take here (created as substasks): 1) Revisit the existing 'transitive execution type selection' conditions to reduce memory pressure. In above example, we should actually pull the matrix multiplication transitively into SPARK - this would not just reduce memory pressure but also reduce the size of transferred intermediates. So far we do this only for the left input of matrix multiplications. 2) Keep track of referenced parallelized RDDs and broadcasts with appropriate fallbacks once they do not fit together into local memory. This is similar to keeping track of pinned variables as already done in CacheableData. 3) Approach for handling RDD collects with unknown sparsity: While debugging this use case, I also came across a couple of instances where we unnecessarily collect sparse matrix RDDs into dense matrix block because the nnz are unknown as the RDD is not yet computed. However, in many cases the compiler already knows a worst-case sparsity estimate - we could propagate these in a best effort manner through spark instructions.

Matthias Boehm added a comment - 14/Feb/17 09:23

ok with the changes in ~~SYSTEMML-1261~~ the script runs now through but we should realize the other two items as well.

Matthias Boehm added a comment - 14/Feb/17 09:23 ok with the changes in SYSTEMML-1261 the script runs now through but we should realize the other two items as well.

Matthias Boehm added a comment - 14/Feb/17 19:40

gweidner do you still have the logs from the runs for our 0.12 release? Looking over the results from ~~SYSTEMML-1217~~ it appears that the tests for intercept 1 and 2 failed there to.

Matthias Boehm added a comment - 14/Feb/17 19:40 gweidner do you still have the logs from the runs for our 0.12 release? Looking over the results from SYSTEMML-1217 it appears that the tests for intercept 1 and 2 failed there to.

Glenn Weidner added a comment - 15/Feb/17 00:29

Yes mboehm7 - I do still have the full logs for runMultiLogReg_10M_1k_sparse_k150 for both 0.12 and 0.11 used for comparison. I've attached the logs to ~~SYSTEMML-1217~~ for MultiLogReg sparse for both 0.12 and 0.11 on same cluster. The runMultiLogReg.sh that was used for the logs also added to ~~SYSTEMML-1217~~. Additional logs for other cases are available if needed.

Glenn Weidner added a comment - 15/Feb/17 00:29 Yes mboehm7 - I do still have the full logs for runMultiLogReg_10M_1k_sparse_k150 for both 0.12 and 0.11 used for comparison. I've attached the logs to SYSTEMML-1217 for MultiLogReg sparse for both 0.12 and 0.11 on same cluster. The runMultiLogReg.sh that was used for the logs also added to SYSTEMML-1217 . Additional logs for other cases are available if needed.

Matthias Boehm added a comment - 15/Feb/17 00:38

Thanks gweidner - that explains it. So the same issue affects also 0.11 and 0.12 (and probably earlier releases too as we usually tested against something like 20 classes).

Matthias Boehm added a comment - 15/Feb/17 00:38 Thanks gweidner - that explains it. So the same issue affects also 0.11 and 0.12 (and probably earlier releases too as we usually tested against something like 20 classes).

Imran Younus added a comment - 17/Feb/17 23:47

mboehm7 I ran this test again after your fix, and it completed successfully. Should we close this jira now?

Imran Younus added a comment - 17/Feb/17 23:47 mboehm7 I ran this test again after your fix, and it completed successfully. Should we close this jira now?

Matthias Boehm added a comment - 18/Feb/17 05:56

sounds good - thanks for confirming iyounus

Matthias Boehm added a comment - 18/Feb/17 05:56 sounds good - thanks for confirming iyounus

Matthias Boehm added a comment - 18/Feb/17 05:59

I'm closing this issue as the new transitive execution type selection fixed the specific OOM - the other robustness features will be addressed in separate JIRAs.

Matthias Boehm added a comment - 18/Feb/17 05:59 I'm closing this issue as the new transitive execution type selection fixed the specific OOM - the other robustness features will be addressed in separate JIRAs.

People

Assignee:: Matthias Boehm

Reporter:: Imran Younus

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 10/Feb/17 01:08

Updated:: 18/Feb/17 05:59

Resolved:: 18/Feb/17 05:59

SystemDS

Details

Description

Attachments

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates