Description
Poisson GLM fails for many standard data sets. The issue is incorrect initialization leading to almost zero probability and weights. The following simple example reproduces the error.
val datasetPoissonLogWithZero = Seq( LabeledPoint(0.0, Vectors.dense(18, 1.0)), LabeledPoint(1.0, Vectors.dense(12, 0.0)), LabeledPoint(0.0, Vectors.dense(15, 0.0)), LabeledPoint(0.0, Vectors.dense(13, 2.0)), LabeledPoint(0.0, Vectors.dense(15, 1.0)), LabeledPoint(1.0, Vectors.dense(16, 1.0)), LabeledPoint(0.0, Vectors.dense(10, 0.0)), LabeledPoint(0.0, Vectors.dense(15, 0.0)), LabeledPoint(0.0, Vectors.dense(12, 2.0)), LabeledPoint(0.0, Vectors.dense(13, 0.0)), LabeledPoint(1.0, Vectors.dense(15, 0.0)), LabeledPoint(1.0, Vectors.dense(15, 0.0)), LabeledPoint(0.0, Vectors.dense(15, 0.0)), LabeledPoint(0.0, Vectors.dense(12, 2.0)), LabeledPoint(1.0, Vectors.dense(12, 2.0)) ).toDF() val glr = new GeneralizedLinearRegression() .setFamily("poisson") .setLink("log") .setMaxIter(20) .setRegParam(0) val model = glr.fit(datasetPoissonLogWithZero)
The issue is in the initialization: the mean is initialized as the response, which could be zero. Applying the log link results in very negative numbers (protected against -Inf), which again leads to close to zero probability and weights in the weighted least squares. The fix is easy: just add a small constant, highlighted in red below.
override def initialize(y: Double, weight: Double): Double = {
require(y >= 0.0, "The response variable of Poisson family " +
s"should be non-negative, but got $y")
y + 0.1
}
I already have a fix and test code. Will create a PR.