Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
-
-
- Problem
-
Need a Noop `WorkUnit` in Gobblin. A Noop `WorkUnit` is one that does not need a `Task`. It can be thought of as a pass-through-workunit that only exists because the source wants to persist it in gobblin state store.
-
-
- Use Cases
1. If a job has an exponential retry policy (base 2) for workunits, a workunit is retried only on 2, 4, 8 .. attempts. The rest are Noop attempts. For these attempts the source still needs to create a dummy workunit and update the backoff value so that the `WorkUnit` is persisted in state store.
2. When a dataset is blacklisted, the source has to create a dummy workunit with previous high watermark so that gobblin persists the high watermark in state store. By not doing so, the high watermark is lost. This can be solved used datasetUrns where gobblin automatically looks at the last available previous high watermark. But not all jobs use datasetUrn. Gobblin kafka uses this approach.
3. Some jobs (Avro to ORC) use a dedicated `WatermarkWorkUnit` to store a vector of watermarks for all partitions of a dataset. This is to reduce the number of dummy workunits when partitions are not modified. - Current Workarounds
- Use Cases
-
In each of the above use case, the extractor has some logic to skip such dummy workunits.
In each of the above use case, gobblin launcher a mapper for dummy workunits.
In the retry case, the `WorkUnit` is intentionally set to `FAILED` so that `AbstractSource.getPreviousWorkUnitStatesForRetry()` returns this workunit in the next run. This seems to be hacky.
-
-
- Proposed Solution
1. Add a new `WorkingState` called `SKIPPED` in `gobblin.configuration.WorkUnitState.WorkingState`.
2. The source can create workunits with `ConfigurationKeys.WORK_UNIT_WORKING_STATE_KEY` set to `SKIPPED`. `AbstractJobLauncher` will skip creating `Task`s for these workunits. It will however continue to persist these workunits in the state store.
- Proposed Solution
-
@ibuenros and @chavdar please review.
Github Url : https://github.com/linkedin/gobblin/issues/1314
Github Reporter : pcadabam
Github Created At : 2016-10-13T22:58:18Z
Github Updated At : 2016-12-15T13:51:49Z
Comments
chavdar wrote on 2016-10-13T23:23:27Z : Why do we need to create a container a run a work unit that will do almost nothing? Why can't this be handled in the source itself ?
Github Url : https://github.com/linkedin/gobblin/issues/1314#issuecomment-253668272
pcadabam wrote on 2016-10-13T23:43:47Z : Discussed offline. The approach seems to be fine.
Github Url : https://github.com/linkedin/gobblin/issues/1314#issuecomment-253671516
aditya1105 wrote on 2016-12-15T13:51:49Z : This is fixed with https://github.com/linkedin/gobblin/pull/1339
Github Url : https://github.com/linkedin/gobblin/issues/1314#issuecomment-267331921