Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-35761

Speed up the restore process of unaligned checkpoint

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.20.0, 1.19.1
    • None
    • None

    Description

      Currently, the task will transition state from ExecutionState.INITIALIZING to ExecutionState.RUNNING after all input buffers are processed when job restores from unaligned checkpoint.

      It will cause the restore time is very long if the performance is not strong and unaligned checkpoint snapshotted too many input buffers. From my experience, the restore time will excess 30 minutes when job with high parallelism.

      We hope the job is switched to RUNNING asap. Because the new checkpoint is unable to be triggered during INITIALIZING. If the job is switched to RUNNING, the new unaligned checkpoint can be made.

      Solution:

      In brief:

      1. The task is switched to RUNNING after all input buffers are added to RecoveredInputChannel.
        • In general, it's quick unless the network buffer isn't enough.
        • When the network buffer isn't enough, it still needs to wait for some buffers are released. (Buffer will be released after a part of data are processed.)
      2. RecoveredInputChannel supports snapshot for network buffers

       

      Additional improvement:

      • RecoveredInputChannel only requests the ExclusiveBuffers, and doesn't request the floating buffers.
      • It cause the network buffer isn't enough for RecoveredInputChannel if the floating buffer is used for old job that creating this checkpoint.
      • We could let RecoveredInputChannel support request floating buffer in other Jira if this optimization makes sense.

      Attachments

        Activity

          People

            fanrui Rui Fan
            fanrui Rui Fan
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: