Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-13842

Add per-aggregation step before repartitioning

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • streams

    Description

      Kafka Streams follows a continuous refinement model for aggregation. For this reason, we never implement a pre-aggregation step before data repartitioning, because it won't help much to reduce repartition cost (there is no natural boundary when a pre-aggregation is finished and when to emit it downstream for the actual aggregation roll-up).

      With https://issues.apache.org/jira/browse/KAFKA-13785 we introduce a per-aggregation "emit final" feature (different to suppress()) that changes the continuous refinement model and thus it seems to be a good optimization to add a pre-aggregation step if this new feature is used.

      We might want to give user control over inserting the pre-aggregation step because there is no free lunch... If we have X distinct keys, pre-aggregation implies that the upstream RocksDB store will need to store up to X rows to hold the pre-aggregate. Thus, given N input partitions, we need to hold N*X rows (upstream) plus X rows (in the final donwstream aggregation). – In contrast, a direct repartition step will only require to hold X rows downstream. It's a tradeoff between (much) higher disk usage vs network/Kafka traffic.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              mjsax Matthias J. Sax
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: