[SPARK-32676] Fix double caching in KMeans/BiKMeans - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.0, 3.1.0
Fix Version/s: 3.0.1, 3.1.0
Component/s: ML
Labels:
None

Description

In the .mllib side, if the storageLevel of input data is always ignored and cached twice:

@Since("0.8.0")
def run(data: RDD[Vector]): KMeansModel = {
  val instances = data.map(point => (point, 1.0))
  runWithWeight(instances, None)
}

private[spark] def runWithWeight(
    data: RDD[(Vector, Double)],
    instr: Option[Instrumentation]): KMeansModel = {

  // Compute squared norms and cache them.
  val norms = data.map { case (v, _) =>
    Vectors.norm(v, 2.0)
  }

  val zippedData = data.zip(norms).map { case ((v, w), norm) =>
    new VectorWithNorm(v, norm, w)
  }

  if (data.getStorageLevel == StorageLevel.NONE) {
    zippedData.persist(StorageLevel.MEMORY_AND_DISK)
  }
  val model = runAlgorithmWithWeight(zippedData, instr)
  zippedData.unpersist()

  model
}

Attachments

Issue Links

links to

[Github] Pull Request #29501 (zhengruifeng)

[Github] Pull Request #29528 (huaxingao)

Activity

People

Assignee:: Ruifeng Zheng

Reporter:: Ruifeng Zheng

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 21/Aug/20 03:27

Updated:: 30/Aug/20 02:00

Resolved:: 23/Aug/20 22:14