Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-33137 FLIP-312: Prometheus Sink Connector
  3. FLINK-36319

FAIL behavior on non-retriable write errors causes an infinite loop when restarting from checkpoint

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • None
    • None

    Description

      The FAIL (default) error handling behavior when a write request is rejected as non-retriable (onPrometheusNonRetriableError), causes the job to fail and restart.

      Restarting from checkpoint causes some out-of-order (duplicate) writes, that by default Prometheus rejects as non-retrable.

      As a consequence, when onPrometheusNonRetriableError = FAIL any restarts from checkpoint puts the job in an infinite loop.

      Changes:

      1. default onPrometheusNonRetriableError should be DISCARD_AND_CONTINUE
      2. onPrometheusNonRetriableError cannot be set to FAIL
      3. Amend docs

      We can keep the rest of the implementation as-is for the moment, and just prevent from setting FAIL for this behaviour, as we may expand handling this error with a different behaviour

      Attachments

        Issue Links

          Activity

            People

              nicusX Lorenzo Nicora
              nicusX Lorenzo Nicora
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: