[FLINK-21030] Broken job restart for job with disjoint graph - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.11.2
Fix Version/s: 1.11.4, 1.12.2, 1.13.0
Component/s: Runtime / Coordination
Labels:
- pull-request-available

Description

Building on top of bugs:

https://issues.apache.org/jira/browse/FLINK-21028

and https://issues.apache.org/jira/browse/FLINK-21029 :

I tried to stop a Flink application on YARN via savepoint which didn't succeed due to a possible bug/racecondition in shutdown (Bug 21028). Due to some reason, Flink attempted to restart the pipeline after the failure in shutdown (21029). The bug here:

As I mentioned: My jobgraph is disjoint and the pipelines are fully isolated. Lets say the original error occured in a single task of pipeline1. Flink then restarted the entire pipeline1, but pipeline2 was shutdown successfully and switched the state to FINISHED.

My job thus was in kind of an invalid state after the attempt to stopping: One of two pipelines was running, the other was FINISHED. I guess this is kind of a bug in the restarting behavior that only all connected components of a graph are restarted, but the others aren't...

Attachments

Issue Links

is related to

FLINK-21029 Failure of shutdown lead to restart of (connected) pipeline

Open

FLINK-21028 Streaming application didn't stop properly

Closed

relates to

FLINK-17170 Cannot stop streaming job with savepoint which uses kinesis consumer

Resolved

links to

GitHub Pull Request #14847

GitHub Pull Request #15034

GitHub Pull Request #15035

(1 links to)

Activity

People

Assignee:: Matthias Pohl

Reporter:: Theo Diefenthal

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 19/Jan/21 10:55

Updated:: 26/Feb/21 13:50

Resolved:: 26/Feb/21 13:49