Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-19534

Unbounded queues in native transport requests lead to node instability

    XMLWordPrintableJSON

Details

    Description

      When a node is under pressure, hundreds of thousands of requests can show up in the native transport queue, and it looks like it can take way longer to timeout than is configured.  We should be shedding load much more aggressively and use a bounded queue for incoming work.  This is extremely evident when we combine a resource consuming workload with a smaller one:

      Running 5.0 HEAD on a single node as of today:

      # populate only
      easy-cass-stress run RandomPartitionAccess -p 100  -r 1 --workload.rows=100000 --workload.select=partition --maxrlat 100 --populate 10m --rate 50k -n 1
      
      # workload 1 - larger reads
      easy-cass-stress run RandomPartitionAccess -p 100  -r 1 --workload.rows=100000 --workload.select=partition --rate 200 -d 1d
      
      # second workload - small reads
      easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h

      It appears our results don't time out at the requested server time either:

       

                       Writes                                  Reads                                  Deletes                       Errors
        Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
       950286       70403.93        634.77 |  789524       70442.07        426.02 |       0              0             0 | 9580484         18980.45
       952304       70567.62         640.1 |  791072       70634.34        428.36 |       0              0             0 | 9636658         18969.54
       953146       70767.34         640.1 |  791400       70767.76        428.36 |       0              0             0 | 9695272         18969.54
       956833       71171.28        623.14 |  794009        71175.6        412.79 |       0              0             0 | 9749377         19002.44
       959627       71312.58        656.93 |  795703       71349.87        435.56 |       0              0             0 | 9804907         18943.11

       

      After stopping the load test altogether, it took nearly a minute before the requests were no longer queued.

      Attachments

        1. image-2024-08-08-14-25-12-915.png
          246 kB
          Gaurav Agarwal
        2. image-2024-08-07-11-37-58-417.png
          40 kB
          Gaurav Agarwal
        3. ci_summary-4.1.html
          34 kB
          Alex Petrov
        4. ci_summary-trunk.html
          108 kB
          Alex Petrov
        5. ci_summary-5.0.html
          29 kB
          Alex Petrov
        6. image-2024-05-03-16-08-10-101.png
          72 kB
          Jon Haddad
        7. screenshot-9.png
          48 kB
          Jon Haddad
        8. screenshot-8.png
          44 kB
          Jon Haddad
        9. screenshot-7.png
          46 kB
          Jon Haddad
        10. screenshot-6.png
          48 kB
          Jon Haddad
        11. screenshot-5.png
          47 kB
          Jon Haddad
        12. screenshot-4.png
          49 kB
          Jon Haddad
        13. screenshot-3.png
          48 kB
          Jon Haddad
        14. screenshot-2.png
          51 kB
          Jon Haddad
        15. screenshot-1.png
          52 kB
          Jon Haddad
        16. ci_summary.html
          55 kB
          Alex Petrov
        17. Scenario 2 - QUEUE + Backpressure.jpg
          1.97 MB
          Alex Petrov
        18. Scenario 2 - QUEUE.jpg
          1.80 MB
          Alex Petrov
        19. Scenario 2 - Stock.jpg
          844 kB
          Alex Petrov
        20. Scenario 1 - Stock.jpg
          601 kB
          Alex Petrov
        21. Scenario 1 - QUEUE + Backpressure.jpg
          584 kB
          Alex Petrov
        22. Scenario 1 - QUEUE.jpg
          499 kB
          Alex Petrov

        Issue Links

          Activity

            People

              ifesdjeen Alex Petrov
              rustyrazorblade Jon Haddad
              Alex Petrov
              Caleb Rackliffe
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 10h 10m
                  10h 10m