Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32676

Fix double caching in KMeans/BiKMeans

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0, 3.1.0
    • 3.0.1, 3.1.0
    • ML
    • None

    Description

      In the .mllib side, if the storageLevel of input data is always ignored and cached twice:

      @Since("0.8.0")
      def run(data: RDD[Vector]): KMeansModel = {
        val instances = data.map(point => (point, 1.0))
        runWithWeight(instances, None)
      }
       
      private[spark] def runWithWeight(
          data: RDD[(Vector, Double)],
          instr: Option[Instrumentation]): KMeansModel = {
      
        // Compute squared norms and cache them.
        val norms = data.map { case (v, _) =>
          Vectors.norm(v, 2.0)
        }
      
        val zippedData = data.zip(norms).map { case ((v, w), norm) =>
          new VectorWithNorm(v, norm, w)
        }
      
        if (data.getStorageLevel == StorageLevel.NONE) {
          zippedData.persist(StorageLevel.MEMORY_AND_DISK)
        }
        val model = runAlgorithmWithWeight(zippedData, instr)
        zippedData.unpersist()
      
        model
      } 

      Attachments

        Activity

          People

            podongfeng Ruifeng Zheng
            podongfeng Ruifeng Zheng
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: