Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-35433

Provide a config parameter to set {{publishNotReadyAddresses}} option for the jobmanager's RPC Kubernetes service

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      Context:
      In native Kubernetes deployment Flink creates a headless service for JobManager's RPC calls. The description down below is only relevant for Flink deployment in Application mode.

      When there are livenessProbe and/or readinessProbe are defined with initialDelaySeconds, created instances of TaskManager have to wait until JobManager's probes are green, before they are able to connect to the JobManager.

      Probes configuration:

      - name: flink-main-container
        livenessProbe:               
          httpGet:                 
            path: /jobs/overview
            port: rest
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 10
          failureThreshold: 6
          successThreshold: 1
          timeoutSeconds: 5
        readinessProbe: 
          httpGet: 
            path: /jobs/overview
            port: rest
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 10
          failureThreshold: 6
          successThreshold: 1
          timeoutSeconds: 5
      

      During this period there are log messages in the TaskManager like:

      Failed to connect to [dev-pipeline.dev-namespace:6123] from local address [dev-pipeline-taskmanager-1-1/11.41.6.81] with timeout [200] due to: dev-pipeline.dev-namespace
      

       

      Issue:
      Because initialization time of different Flink jobs (read: Flink deployments) can vary in a wide range, it would be convenient to have a common configuration for livenessProbe and/or readinessProbe for all deployments, which will then cover the worst case, instead of tuning it for every deployment. On the other hand, it would be nice to reduce the job's bootstrap time as a whole, because the jobs' re-deployment in our case happens often and it affects response time of incoming requests from clients.

       

      Solution:
      To reduce the job's bootstrap time as a whole one solution could be to set publishNotReadyAddresses flag via config parameter in jobmanager's RPC Kubernetes service, so that created instance of a taskmanager can connect to the jobmanager immediately.
      Publishing "not ready" JobManager's RPC should not cause any issue, because the TaskManager instances in Kubernetes native deployment are created by a ResourceManager, which is part of the JobManager, which in turn guarantees, that JobManager is ready and ExecutionGraph was built successfully when a TaskManager is starting.
      Making this flag optional guarantees, that such approach will work correctly, when the flag is disabled and JobManager High Availability is defined, which in turn involves the leader election.

       

      Affected Classes:

      • org.apache.flink.kubernetes.kubeclient.services.HeadlessClusterIPService - by adding one line .withPublishNotReadyAddresses(kubernetesJobManagerParameters.isPublishNotReadyAddresses()) in {{Service buildUpInternalService(
        KubernetesJobManagerParameters kubernetesJobManagerParameters)}}
      • org.apache.flink.kubernetes.configuration.KubernetesConfigOptions - by adding something like kubernetes.jobmanager.rpc.service.publish-not-ready-addresses option
      • org.apache.flink.kubernetes.kubeclient.parameters.KubernetesJobManagerParameters - by adding the get method for the parameter: {{public boolean isPublishNotReadyAddresses() { return flinkConfig.getBoolean(KubernetesConfigOptions.KUBERNETES_JOBMANAGER_RPC_SERVICE_PUBLISH_NOT_READY_ADDRESSES); }}}
      • Tests to cover the new parameter

      If there is a decision, that such improvement worth to be part of Flink, I am ready to provide a PR for it.

      Attachments

        Activity

          People

            Unassigned Unassigned
            dantalian_pv Pavel
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: