Description
SPARK-23381 added a corrected MurmurHash3 implementation but left the old implementation alone. In Spark 2.3 and earlier, HashingTF will use the old implementation. (We should not backport a fix for HashingTF since it would be a major change of behavior.) But we should correct HashingTF in Spark 2.4; this JIRA is for tracking this fix.
- Update HashingTF to use new implementation of MurmurHash3
- Ensure backwards compatibility for ML persistence by having HashingTF use the old MurmurHash3 when a model from Spark 2.3 or earlier is loaded. We can add a Param to allow this.
Also, HashingTF still calls into the old spark.mllib.feature.HashingTF, so I recommend we first migrate the code to spark.ml: SPARK-21748. We can leave spark.mllib alone and just fix MurmurHash3 in spark.ml.
Attachments
Issue Links
- is blocked by
-
SPARK-21748 Migrate the implementation of HashingTF from MLlib to ML
- Resolved
- is related to
-
SPARK-23381 Murmur3 hash generates a different value from other implementations
- Resolved
- links to