Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-1143

1-1 source split event should be handled in Vertex.RUNNING and Vertex.INITED state

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.5.0
    • None
    • None
    • Reviewed

    Description

      One-one edge fail when the parallelism of source vertex changes dynamically (through a ShuffleVertexManager). Here is the stack:

      2014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Vertex vertex_1400646157236_0012_1_03 parallelism set to 1 from 202014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_0000012014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_0000022014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_0000032014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_0000042014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_0000052014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_000006
      2014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_0000072014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_000008
      2014-05-21 00:05:55,284 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_000009
      2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_000010
      2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_000011
      2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_000012
      2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_000013
      2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_000014
      2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_000015
      2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_000016
      2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_000017
      2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_000018
      2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Removing task: task_1400646157236_0012_1_03_000019
      2014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Replacing edge manager for source:scope-41 destination: vertex_1400646157236_0012_1_032014-05-21 00:05:55,285 INFO [AsyncDispatcher event handler] org.apache.tez.dag.history.HistoryEventHandler: [HISTORY][DAG:dag_1400646157236_0012_1][Event:VERTEX_PARALLELISM_UPDATED]: vertexId=vertex_1400646157236_0012_1_03, numTasks=1, vertexLocationHint=null, edgeManagersCount=12014-05-21 00:05:55,286 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: Vertex vertex_1400646157236_0012_1_02 completed., numCompletedVertices=3, numSuccessfulVertices=3, numFailedVertices=0, numKilledVertices=0, numVertices=72014-05-21 00:05:55,287 ERROR [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.VertexImpl: Can't handle Invalid event V_ONE_TO_ONE_SOURCE_SPLIT on vertex scope-61 with vertexId vertex_1400646157236_0012_1_05 at current state RUNNINGorg.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: V_ONE_TO_ONE_SOURCE_SPLIT at RUNNING
              at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
              at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
              at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)        at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1263)
              at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:158)
              at org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1716)        at org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1702)
              at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:134)
              at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:81)        at java.lang.Thread.run(Thread.java:695)
      

      Attached complete AM log. scope-42 is the source vertex and scope-61 is the destination vertex.

      The issue is that the code assumed that the split event will come before the vertex starts. This may not be valid in all cases. E.g. if the event comes from 2 different paths in the DAG then the vertex can start after 1 path sets the parallelism and then the second path sends the event. Also if the previous vertex was a shuffle/reduce then its parallelism can change while its running, resulting in changing the current vertex parallelism while its running.

      Attachments

        1. syslog_dag_1400696568249_0001_1
          70 kB
          Daniel Dai
        2. TEZ-1143.1.patch
          7 kB
          Bikas Saha
        3. TEZ-1143.2.patch
          13 kB
          Bikas Saha
        4. TEZ-1143.3.patch
          14 kB
          Bikas Saha
        5. TEZ-1143.addendum.patch
          4 kB
          Bikas Saha

        Issue Links

          Activity

            People

              bikassaha Bikas Saha
              daijy Daniel Dai
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: