Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
The FAIL (default) error handling behavior when a write request is rejected as non-retriable (onPrometheusNonRetriableError), causes the job to fail and restart.
Restarting from checkpoint causes some out-of-order (duplicate) writes, that by default Prometheus rejects as non-retrable.
As a consequence, when onPrometheusNonRetriableError = FAIL any restarts from checkpoint puts the job in an infinite loop.
Changes:
1. default onPrometheusNonRetriableError should be DISCARD_AND_CONTINUE
2. onPrometheusNonRetriableError cannot be set to FAIL
3. Amend docs
We can keep the rest of the implementation as-is for the moment, and just prevent from setting FAIL for this behaviour, as we may expand handling this error with a different behaviour
Attachments
Issue Links
- links to