Details
-
Sub-task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
VertexManager is one part of Vertex and it is a user-facing API. Task's recovery not only depend on Vertex but also on VertexManager. Currently VertexManager may interact with Vertex within the whole lifecycle of Vertex. This make the recovery of Vertex/Task pretty complicated. The recovery of VertexManager is almost impossible, because it is user-facing API, we don't have control on that.
Define the completeness could help the recovery of Vertex. The completeness of VertexManager means it has complete its responsibility and won't interact with Vertex and won't be used by vertex again which means if VertexManager is in completed state then we don't need it in recovery.
The following are methods VertexManager interact with Vertex through VertexManagerPluginContext. We can classify these methods into 2 types. One is for recofigure vertex like change parallelism, source edge manager and etc. Another kind is for scheduling tasks. If VertexManager is in completed state, that means these methods won't be called again.
- setVertexParallelism
- reconfigureVertex
- vertexReconfigurationPlanned
- doneReconfiguringVertex
- scheduleVertexTasks
Initial idea to represent the completeness of VertexManager.
- If VertexImpl#vertexReconfigurationPlanned is not invoked, 1 condition for the completeness of VertexManager:
- All the tasks are started ( All TaskStartedEvents are seen, otherwise we can't guratteen VertexManager will schedule tasks the same as last AM attempt). That means VertexManager won't call scheduleTasks again.
- If VertexImpl#vertexReconfigurationPlanned is invoked, 2 conditions for the completeness of VertexManager
- VertexImpl#doneReconfiguringVertex is invoked:
- All the tasks are started ( All TaskStartedEvents are seen, otherwise we can't guratteen VertexManager will schedule tasks the same as last AM attempt), That means VertexManager won't call scheduleTasks again.
If VertexManager is in completed state, we can continue the recovery of vertex based on the recovery events. Otherwise recover the vertex from scratch.
Things may change after TEZ-2103 which may kill tasks after running. We may need to introduce complete API for VertexManager.