Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
Suppose the sizes of the job state and each task state is 0.9M and 0.1M respectively, and there are 1000 tasks.
There are two issues that cause high memory usage:
1. When a task runs, it merges the job state with the properties in the workunit (e.g., in MR mode, this is done in `MRJobLauncher.TaskRunner.map()`. So each task state is 1M. When tasks complete, `TaskStateCollectorService` collects all task states into memory. That takes 1G memory, 900M of which is redundant.
2. In `JobContext.commit`, the first step is to create dataset states. Each dataset state also contains a copy of job state (due to the `addAll` call in `JobState.newDatasetState()`). In the worst case, there are as many dataset states as task states (i.e., 1000), so that takes another 900M of memory, which are also redundant information.
As a result the memory usage is 1.9G, but theoretically we only need 100M.
As can be seen from [this heap dump](http://s11.postimg.org/kydflisj7/image.png), a huge amount of memory is consumed by states, and there's a lot of redundancy.
One possible solution is to never use more than one copy of job state, and either persist job state and task states separately in the state store, or only merge job state with each task state in the last moment before they are persisted.
Github Url : https://github.com/linkedin/gobblin/issues/881
Github Reporter : zliu41
Github Created At : 2016-03-25T01:24:15Z
Github Updated At : 2017-01-12T04:50:38Z