Uploaded image for project: 'Samza'
  1. Samza
  2. SAMZA-1613

Stream-table join: Intermediate streams do not inherit bootstrap stream semantic

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      There are two issues with repartitioning a bootstrap stream: 

      • Samza does not propagate the bootstrap semantics to the intermediate stream. 
      • For a bootstrap stream, Samza gets the newest offset for that stream before consumption and considers the bootstrap to be complete once the message with that offset is consumed. This is clearly a problem for intermediate streams as there might not be any messages at the beginning of the job. 

      Even though the first stage abides by the bootstrap semantics and does not consume from non-bootstrap streams until all the bootstrap stream partitions owned by that container are re-partitioned, there will be scenarios where one container finishes bootstrap faster than others and also the rate of consumption from the intermediate stream in the second stage might be slower. All these would result in the job breaking the bootstrap semantics across the two stages. Please note that this issue is not just specific to repartitioning but could happen with any intermediate streams. 

      To support this, we will have to solve both the afore-mentioned issues: 

      • Propagate bootstrap semantics to the intermediate streams if the repartitioned stream is bootstrap stream. 
      • Either add end of bootstrap control message (note that all the containers have to co-ordinate here) or wait for event time support in Samza. 

      This is clearly a problem for Stream-Table join. Consider the scenario where the load on the stream is much higher than the load on the stream marked as table. Consequently, the number of partitions in the stream will be higher than that of the table. To join these two, we will have to increase the number of partitions in the table by repartitioning, which is where we end up in the premature join of stream with the table, before seeding the table. 

      Attachments

        Issue Links

          Activity

            People

              ahabdulhamid Ahmed Abdul Hamid
              atoomula Aditya Toomula
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: