[FLINK-36319] FAIL behavior on non-retriable write errors causes an infinite loop when restarting from checkpoint - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
- pull-request-available

Description

The FAIL (default) error handling behavior when a write request is rejected as non-retriable (onPrometheusNonRetriableError), causes the job to fail and restart.

Restarting from checkpoint causes some out-of-order (duplicate) writes, that by default Prometheus rejects as non-retrable.

As a consequence, when onPrometheusNonRetriableError = FAIL any restarts from checkpoint puts the job in an infinite loop.

Changes:

1. default onPrometheusNonRetriableError should be DISCARD_AND_CONTINUE
2. onPrometheusNonRetriableError cannot be set to FAIL
3. Amend docs

We can keep the rest of the implementation as-is for the moment, and just prevent from setting FAIL for this behaviour, as we may expand handling this error with a different behaviour

Attachments

Issue Links

links to

GitHub Pull Request #7

Activity

People

Assignee:: Lorenzo Nicora

Reporter:: Lorenzo Nicora

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 18/Sep/24 15:56

Updated:: 24/Sep/24 15:49

Resolved:: 24/Sep/24 15:49