Uploaded image for project: 'Apache Gobblin'
  1. Apache Gobblin
  2. GOBBLIN-82

Dummy/noop workunits in gobblin

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

          1. Problem

      Need a Noop `WorkUnit` in Gobblin. A Noop `WorkUnit` is one that does not need a `Task`. It can be thought of as a pass-through-workunit that only exists because the source wants to persist it in gobblin state store.

          1. Use Cases
            1. If a job has an exponential retry policy (base 2) for workunits, a workunit is retried only on 2, 4, 8 .. attempts. The rest are Noop attempts. For these attempts the source still needs to create a dummy workunit and update the backoff value so that the `WorkUnit` is persisted in state store.
            2. When a dataset is blacklisted, the source has to create a dummy workunit with previous high watermark so that gobblin persists the high watermark in state store. By not doing so, the high watermark is lost. This can be solved used datasetUrns where gobblin automatically looks at the last available previous high watermark. But not all jobs use datasetUrn. Gobblin kafka uses this approach.
            3. Some jobs (Avro to ORC) use a dedicated `WatermarkWorkUnit` to store a vector of watermarks for all partitions of a dataset. This is to reduce the number of dummy workunits when partitions are not modified.
          2. Current Workarounds

      In each of the above use case, the extractor has some logic to skip such dummy workunits.
      In each of the above use case, gobblin launcher a mapper for dummy workunits.
      In the retry case, the `WorkUnit` is intentionally set to `FAILED` so that `AbstractSource.getPreviousWorkUnitStatesForRetry()` returns this workunit in the next run. This seems to be hacky.

          1. Proposed Solution
            1. Add a new `WorkingState` called `SKIPPED` in `gobblin.configuration.WorkUnitState.WorkingState`.
            2. The source can create workunits with `ConfigurationKeys.WORK_UNIT_WORKING_STATE_KEY` set to `SKIPPED`. `AbstractJobLauncher` will skip creating `Task`s for these workunits. It will however continue to persist these workunits in the state store.

      @ibuenros and @chavdar please review.

      Github Url : https://github.com/linkedin/gobblin/issues/1314
      Github Reporter : pcadabam
      Github Created At : 2016-10-13T22:58:18Z
      Github Updated At : 2016-12-15T13:51:49Z

      Comments


      chavdar wrote on 2016-10-13T23:23:27Z : Why do we need to create a container a run a work unit that will do almost nothing? Why can't this be handled in the source itself ?

      Github Url : https://github.com/linkedin/gobblin/issues/1314#issuecomment-253668272


      pcadabam wrote on 2016-10-13T23:43:47Z : Discussed offline. The approach seems to be fine.

      Github Url : https://github.com/linkedin/gobblin/issues/1314#issuecomment-253671516


      aditya1105 wrote on 2016-12-15T13:51:49Z : This is fixed with https://github.com/linkedin/gobblin/pull/1339

      Github Url : https://github.com/linkedin/gobblin/issues/1314#issuecomment-267331921

      Attachments

        Activity

          People

            Unassigned Unassigned
            abti Abhishek Tiwari
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: