Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-34336

AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang sometimes

    XMLWordPrintableJSON

Details

    Description

      AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang in waitForRunningTasks(restClusterClient, jobID, parallelism2);

      Reason:

      The job has 2 tasks(vertices), after calling updateJobResourceRequirements. The source parallelism isn't changed (It's parallelism) , and the FlatMapper+Sink is changed from  parallelism to parallelism2.

      So we expect the task number should be parallelism + parallelism2 instead of parallelism2.

       

      Why it can be passed for now?

      Flink 1.19 supports the scaling cooldown, and the cooldown time is 30s by default. It means, flink job will rescale job 30 seconds after updateJobResourceRequirements is called.

       

      So the running tasks are old parallelism when we call waitForRunningTasks(restClusterClient, jobID, parallelism2);.

      IIUC, it cannot be guaranteed, and it's unexpected.

       

      How to reproduce this bug?

      https://github.com/1996fanrui/flink/commit/ffd713e24d37db2c103e4cd4361d0cd916d0d2f6

      • Disable the cooldown
      • Sleep for a while before waitForRunningTasks

      If so, the job running in new parallelism, so `waitForRunningTasks` will hang forever.

      Attachments

        Issue Links

          Activity

            People

              fanrui Rui Fan
              fanrui Rui Fan
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: