[FLINK-34336] AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang sometimes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.19.0, 1.20.0
Fix Version/s: 1.19.0
Component/s: Tests
Labels:
- pull-request-available
- test-stability

Description

AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang in waitForRunningTasks(restClusterClient, jobID, parallelism2);

Reason:

The job has 2 tasks(vertices), after calling updateJobResourceRequirements. The source parallelism isn't changed (It's parallelism) , and the FlatMapper+Sink is changed from parallelism to parallelism2.

So we expect the task number should be parallelism + parallelism2 instead of parallelism2.

Why it can be passed for now?

Flink 1.19 supports the scaling cooldown, and the cooldown time is 30s by default. It means, flink job will rescale job 30 seconds after updateJobResourceRequirements is called.

So the running tasks are old parallelism when we call waitForRunningTasks(restClusterClient, jobID, parallelism2);.

IIUC, it cannot be guaranteed, and it's unexpected.

How to reproduce this bug?

https://github.com/1996fanrui/flink/commit/ffd713e24d37db2c103e4cd4361d0cd916d0d2f6

Disable the cooldown
Sleep for a while before waitForRunningTasks

If so, the job running in new parallelism, so `waitForRunningTasks` will hang forever.

Attachments

Issue Links

is caused by

FLINK-33246 Add RescalingIT case that uses checkpoints and resource requests

Resolved

links to

GitHub Pull Request #24248

GitHub Pull Request #24340

Activity

People

Assignee:: Rui Fan

Reporter:: Rui Fan

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 02/Feb/24 05:00

Updated:: 20/Feb/24 15:00

Resolved:: 20/Feb/24 14:22