Uploaded image for project: 'Ratis'
  1. Ratis
  2. RATIS-2162

When closing leaderState, if the logAppender thread sends a snapshot, a deadlock may occur

    XMLWordPrintableJSON

Details

    • Wish
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.0
    • 3.2.0
    • Leader
    • None

    Description

      This is the reason for the jira 2161 problem.
      RATIS-2161 Grpc may spawn many threads - ASF JIRA (apache.org)

      1. Old Leader S receives larger term number and convert to follower.  
      2. LogAppender thread L did not receive the shutdown signal in time due to abnormal triggering of restart
      3. S will hold the ‘server’ lock and wait for L to shut down
      4. L triggers snapshot sending, calls newSnapshotRequests5. In newSnapshotRequests, L will acquire the ‘server’ lock
       

      This eventually leads to a deadlock, grpc cannot reclaim the thread in time, and then the problem of jira 2161 occurs

                                                                           stop LogAppender L
      close LeaderState                                                |
      timeline.  --------------------------------------
                       |                            -----------------------       logAppender L TimeLine
                 shutdown                    |                                  |
             LeaderState                restart                 newInstallSnapshotRequests
                                            logAppender         
       
       
      I think it is possible to check the status of raft every time LogAppender is awakened, and close it if it is not currently the leader

       

       
      In addition, in LeaderStateImpl, there is another concurrency safety issue regarding senderList.
      removeSenders, addSenders, stopAll may be accessed by multiple threads.
      For example, thread t1 creates a futures array with a size of 3 in stopAll, and then thread t2 calls removeSenders, which may cause out-of-bounds access because future.length is 3, but senders .size () < 3.

      Attachments

        1. 1154_review.patch
          11 kB
          Tsz-wo Sze
        2. image-2024-09-24-10-43-34-812.png
          109 kB
          yuuka
        3. image-2024-09-24-10-41-20-140.png
          59 kB
          yuuka

        Issue Links

          Activity

            People

              tohsakarin__ yuuka
              tohsakarin__ yuuka
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 40m
                  1h 40m