Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
Reviewed
Description
One-one edge fail when the parallelism of source vertex changes dynamically (through a ShuffleVertexManager). Here is the stack:
2014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Vertex vertex_1400646157236_0012_1_03 parallelism set to 1 from 202014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_0000012014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_0000022014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_0000032014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_0000042014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_0000052014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_000006 2014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_0000072014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_000008 2014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_000009 2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_000010 2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_000011 2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_000012 2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_000013 2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_000014 2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_000015 2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_000016 2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_000017 2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_000018 2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_000019 2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Replacing edge manager for source:scope-41 destination: vertex_1400646157236_0012_1_032014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] org.apache.tez.dag.history.HistoryEventHandler: [HISTORY][DAG:dag_1400646157236_0012_1][Event:VERTEX_PARALLELISM_UPDATED]: vertexId=vertex_1400646157236_0012_1_03, numTasks=1, vertexLocationHint=null, edgeManagersCount=12014-05-21 00:05:55,286 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: Vertex vertex_1400646157236_0012_1_02 completed., numCompletedVertices=3, numSuccessfulVertices=3, numFailedVertices=0, numKilledVertices=0, numVertices=72014-05-21 00:05:55,287 ERROR [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Can't handle Invalid event V_ONE_TO_ONE_SOURCE_SPLIT on vertex scope-61 with vertexId vertex_1400646157236_0012_1_05 at current state RUNNINGorg.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: V_ONE_TO_ONE_SOURCE_SPLIT at RUNNING at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1263) at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:158) at org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1716) at org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1702) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:134) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:81) at java.lang.Thread.run(Thread.java:695)
Attached complete AM log. scope-42 is the source vertex and scope-61 is the destination vertex.
The issue is that the code assumed that the split event will come before the vertex starts. This may not be valid in all cases. E.g. if the event comes from 2 different paths in the DAG then the vertex can start after 1 path sets the parallelism and then the second path sends the event. Also if the previous vertex was a shuffle/reduce then its parallelism can change while its running, resulting in changing the current vertex parallelism while its running.
Attachments
Attachments
Issue Links
- blocks
-
PIG-3846 Implement automatic reducer parallelism
- Closed