Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
1.6.0
-
None
-
Linux
Description
The RandomForest implementation can easily run out of memory on moderate datasets. This was raised in the a user's benchmarking game on github (https://github.com/szilard/benchm-ml/issues/19). I looked to see if there was a tracking issue, but I couldn't fine one.
Using Spark 1.6, a user of mine is running into problems running the RandomForest training on largish datasets on machines with 64G memory and the following in spark-defaults.conf:
spark.executor.cores 2 spark.executor.instances 199 spark.executor.memory 10240M
I reproduced the excessive memory use from the benchmark example (using an input CSV of 1.3G and 686 columns) in spark shell with spark-shell --driver-memory 30G --executor-memory 30G and have a heap profile from a single machine by running jmap -histo:live <spark-pid>. I took a sample every 5 seconds and at the peak it looks like this:
num #instances #bytes class name ---------------------------------------------- 1: 5428073 8458773496 [D 2: 12293653 4124641992 [I 3: 32508964 1820501984 org.apache.spark.mllib.tree.model.Node 4: 53068426 1698189632 org.apache.spark.mllib.tree.model.Predict 5: 72853787 1165660592 scala.Some 6: 16263408 910750848 org.apache.spark.mllib.tree.model.InformationGainStats 7: 72969 390492744 [B 8: 3327008 133080320 org.apache.spark.mllib.tree.impl.DTStatsAggregator 9: 3754500 120144000 scala.collection.immutable.HashMap$HashMap1 10: 3318349 106187168 org.apache.spark.mllib.tree.model.Split 11: 3534946 84838704 org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo 12: 3764745 60235920 java.lang.Integer 13: 3327008 53232128 org.apache.spark.mllib.tree.impurity.EntropyAggregator 14: 380804 45361144 [C 15: 268887 34877128 <constMethodKlass> 16: 268887 34431568 <methodKlass> 17: 908377 34042760 [Lscala.collection.immutable.HashMap; 18: 1100000 26400000 org.apache.spark.mllib.regression.LabeledPoint 19: 1100000 26400000 org.apache.spark.mllib.linalg.SparseVector 20: 20206 25979864 <constantPoolKlass> 21: 1000000 24000000 org.apache.spark.mllib.tree.impl.TreePoint 22: 1000000 24000000 org.apache.spark.mllib.tree.impl.BaggedPoint 23: 908332 21799968 scala.collection.immutable.HashMap$HashTrieMap 24: 20206 20158864 <instanceKlassKlass> 25: 17023 14380352 <constantPoolCacheKlass> 26: 16 13308288 [Lorg.apache.spark.mllib.tree.impl.DTStatsAggregator; 27: 445797 10699128 scala.Tuple2
Attachments
Attachments
Issue Links
- Is contained by
-
SPARK-14046 RandomForest improvement umbrella
- Resolved
- is related to
-
SPARK-3728 RandomForest: Learn models too large to store in memory
- Resolved