Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18701

Poisson GLM fails due to wrong initialization

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 2.0.2
    • 2.1.0
    • ML
    • None
    • Important

    Description

      Poisson GLM fails for many standard data sets. The issue is incorrect initialization leading to almost zero probability and weights. The following simple example reproduces the error.

      val datasetPoissonLogWithZero = Seq(
            LabeledPoint(0.0, Vectors.dense(18, 1.0)),
            LabeledPoint(1.0, Vectors.dense(12, 0.0)),
            LabeledPoint(0.0, Vectors.dense(15, 0.0)),
            LabeledPoint(0.0, Vectors.dense(13, 2.0)),
            LabeledPoint(0.0, Vectors.dense(15, 1.0)),
            LabeledPoint(1.0, Vectors.dense(16, 1.0)),
            LabeledPoint(0.0, Vectors.dense(10, 0.0)),
            LabeledPoint(0.0, Vectors.dense(15, 0.0)),
            LabeledPoint(0.0, Vectors.dense(12, 2.0)),
            LabeledPoint(0.0, Vectors.dense(13, 0.0)),
            LabeledPoint(1.0, Vectors.dense(15, 0.0)),
            LabeledPoint(1.0, Vectors.dense(15, 0.0)),
            LabeledPoint(0.0, Vectors.dense(15, 0.0)),
            LabeledPoint(0.0, Vectors.dense(12, 2.0)),
            LabeledPoint(1.0, Vectors.dense(12, 2.0))
          ).toDF()
          
      val glr = new GeneralizedLinearRegression()
        .setFamily("poisson")
        .setLink("log")
        .setMaxIter(20)
        .setRegParam(0)
      
      val model = glr.fit(datasetPoissonLogWithZero)
      

      The issue is in the initialization: the mean is initialized as the response, which could be zero. Applying the log link results in very negative numbers (protected against -Inf), which again leads to close to zero probability and weights in the weighted least squares. The fix is easy: just add a small constant, highlighted in red below.

      override def initialize(y: Double, weight: Double): Double = {
      require(y >= 0.0, "The response variable of Poisson family " +
      s"should be non-negative, but got $y")
      y + 0.1
      }

      I already have a fix and test code. Will create a PR.

      Attachments

        Activity

          People

            actuaryzhang Wayne Zhang
            actuaryzhang Wayne Zhang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified