Uploaded image for project: 'Apache Gobblin'
  1. Apache Gobblin
  2. GOBBLIN-118

State management after running workunits

    XMLWordPrintableJSON

Details

    Description

      Suppose the sizes of the job state and each task state is 0.9M and 0.1M respectively, and there are 1000 tasks.

      There are two issues that cause high memory usage:
      1. When a task runs, it merges the job state with the properties in the workunit (e.g., in MR mode, this is done in `MRJobLauncher.TaskRunner.map()`. So each task state is 1M. When tasks complete, `TaskStateCollectorService` collects all task states into memory. That takes 1G memory, 900M of which is redundant.
      2. In `JobContext.commit`, the first step is to create dataset states. Each dataset state also contains a copy of job state (due to the `addAll` call in `JobState.newDatasetState()`). In the worst case, there are as many dataset states as task states (i.e., 1000), so that takes another 900M of memory, which are also redundant information.

      As a result the memory usage is 1.9G, but theoretically we only need 100M.

      As can be seen from [this heap dump](http://s11.postimg.org/kydflisj7/image.png), a huge amount of memory is consumed by states, and there's a lot of redundancy.

      One possible solution is to never use more than one copy of job state, and either persist job state and task states separately in the state store, or only merge job state with each task state in the last moment before they are persisted.

      Github Url : https://github.com/linkedin/gobblin/issues/881
      Github Reporter : zliu41
      Github Created At : 2016-03-25T01:24:15Z
      Github Updated At : 2017-01-12T04:50:38Z

      Attachments

        Activity

          People

            hutran Hung Tran
            abti Abhishek Tiwari
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: