Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
2.2.1
-
None
-
PySpark 2.2.1 / Windows 10
Description
Making predictions from a randomForestClassifier PySpark is much faster than making predictions from an individual tree contained within the .trees attribute.
In fact, the model.transform call without an action is more than 10x slower for an individual tree vs the model.transform call for the random forest model.
See https://stackoverflow.com/questions/49297470/slow-individual-tree-access-for-random-forest-in-pyspark for example with timing.
Ideally:
- Getting a prediction from a single tree should be comparable to or faster than getting predictions from the whole tree
- Getting all the predictions from all the individual trees should be comparable in speed to getting the predictions from the random forest
Attachments
Issue Links
- Is contained by
-
SPARK-14046 RandomForest improvement umbrella
- Resolved