[FLINK-31509] REST Service missing sessionAffinity causes job run failure with HA cluster - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Deployment / Kubernetes
Labels:
None
Environment:

Flink 1.15 on Flink Operator 1.4.0 on Kubernetes 1.25.4, (optionally with Beam 2.46.0)

but the issue was observed on Flink 1.14, 1.15 and 1.16 and on Flink Operator 1.2, 1.3, 1.3.1, 1.4.0

Description

When using a Session Cluster with multiple Job Managers, the -rest service load balances the API requests to all job managers, not just the master.

When submitting a FlinkSessionJob, I often see errors like: `jar <jar_id>.jar was not found`, because the submission is done in 2 steps:

upload the jar with `v1/jars/upload` which returns the `jar_id`
run the job with `v1/jars/<jar_id>/run`

Unfortunately, with the Service load balacing between nodes, it is often the case that the jar is uploaded on a JM, and the run request happens on another, where the jar doesn't exist.

A simple fix is to append the `sessionAffinity: ClientIP` on the -rest service, where the API calls from a given originating IP will always be routed to the same node.

This issue is especially problematic with Beam, where the Beam job submission does not retry to run the job with the jar_id, and will fail, causing it to re-upload a new jar and retrying, until it is lucky enough to get the 2 calls in a row routed to the same node.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Emmanuel Leroy

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 17/Mar/23 22:35

Updated:: 31/Aug/23 14:17