[FLINK-35285] Autoscaler key group optimization can interfere with scale-down.max-factor - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Kubernetes Operator
Labels:
None

Description

When setting a less aggressive scale down limit, the key group optimization can prevent a vertex from scaling down at all. It will hunt from target upwards to maxParallelism/2, and will always find currentParallelism again.

A simple test trying to scale down from a parallelism of 60 with a scale-down.max-factor of 0.2:

assertEquals(48, JobVertexScaler.scale(60, inputShipStrategies, 360, .8, 8, 360));

It seems reasonable to make a good attempt to spread data across subtasks, but not at the expense of total deadlock. The problem is that during scale down it doesn't actually ensure that newParallelism will be < currentParallelism. The only workaround is to set a scale down factor large enough such that it finds the next lowest divisor of the maxParallelism.

Clunky, but something to ensure it can make at least some progress. There is another test that now fails, but just to illustrate the point:

for (int p = newParallelism; p <= maxParallelism / 2 && p <= upperBound; p++) {
    if ((scaleFactor < 1 && p < currentParallelism) || (scaleFactor > 1 && p > currentParallelism)) {
        if (maxParallelism % p == 0) {
            return p;
        }
    }
}

Perhaps this is by design and not a bug, but total failure to scale down in order to keep optimized key groups does not seem ideal.

Key group optimization block:

https://github.com/apache/flink-kubernetes-operator/blob/fe3d24e4500d6fcaed55250ccc816546886fd1cf/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java#L296C1-L303C10

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Trystan

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 02/May/24 16:50

Updated:: 25/Jul/24 19:29