Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-34559

TVF Window Aggregations might get stuck

    XMLWordPrintableJSON

Details

    Description

      RecordsWindowBuffer flushes buffered records in the following cases:

      • watermark
      • checkpoint barrier
      • buffer overflow

       

      In two-phase aggregations, this creates the following problems:

      1) Local aggregation: enters hard-backpressure because for flush, it outputs the data downstream and doesn't check network buffer availability

      This already disrupts normal checkpointing and watermarks progression

       

      2) Global aggregation: 

      When the window is large enough and/or the watermark is lagging, lots of data is flushed to state backend (and the state is updated) in checkpoint SYNC phase.

       

      All this eventually causes checkpoint timeouts (10 minutes in our env).

       

      Example query

      INSERT INTO `target_table` 
      
      SELECT window_start, window_end, some, attributes, SUM(view_time) AS total_view_time, COUNT(*) AS num, LISTAGG(DISTINCT page_url) AS pages 
      
      FROM TABLE(TUMBLE(TABLE source_table, DESCRIPTOR($rowtime), INTERVAL '1' HOUR)) 
      
      GROUP BY window_start, window_end, some, attributes;

      In our setup, the issue can be reproduced deterministically.

       

      As a quick fix, we might want to:

      1. limit the amount of data buffered in Global Aggregation nodes
      2. disable two-phase aggregations, i.e. Local Aggregations (we can try to limit buffing there two, but network buffer availability can not be easily checked from the operator)

      Attachments

        Issue Links

          Activity

            People

              roman Roman Khachatryan
              roman Roman Khachatryan
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: