[BEAM-14332] Improve the workflow of cluster management for Flink on Dataproc - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: P2
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.40.0
Component/s: runner-py-interactive
Labels:
None

Description

Improve the workflow of cluster management.

There is an option to configure a default cluster name. The existing user flows are:

Use the default cluster name to create a new cluster if none is in use;
Reuse a created cluster that has the default cluster name;
If the default cluster name is configured to a new value, re-apply 1 and 2.

A better solution is to

Create a new cluster implicitly if there is none or explicitly if the user wants one with specific provisioning;
Always default to using the last created cluster.

The reasons are:

Cluster name is meaningless to the user when a cluster is just a medium to run OSS runners (as applications) such as Flink or Spark. The cluster could also be running anywhere (on GCP) such as Dataproc, k8s, or even Dataflow itself.
Clusters should be uniquely identified, thus should always have a distinct name. Clusters are managed (created/reused/deleted) behind the scenes by the notebook runtime when the user doesn’t explicitly do so (the capability to explicitly manage clusters is still available). Reusing the same default cluster name is risky when a cluster is deleted by one notebook runtime while another cluster with the same name is created by a different notebook runtime.

Provide the capability for the user to explicitly provision a cluster.

Current implementation provisions each cluster at the location specified by GoogleCloudOptions using 3 worker nodes. There is no explicit API to configure the number or shape of workers.

We could use the WorkerOptions to allow customers to explicitly provision a cluster and expose an explicit API (with UX in notebook extension) for customers to change the size of a cluster connected with their notebook (until we have an auto scaling solution with Dataproc for Flink).

Attachments

Issue Links

is a parent of

BEAM-14330 google.api_core.exceptions.MethodNotImplemented when tests run in parallel

Resolved

links to

GitHub Pull Request #17402

Activity

People

Assignee:: Ning

Reporter:: Ning

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 19/Apr/22 22:24

Updated:: 06/May/22 18:36

Resolved:: 06/May/22 18:36

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 20m