Uploaded image for project: 'Qpid'
  1. Qpid
  2. QPID-5719

HA becomes unresponsive once any of the brokers are SIGSTOPed

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.28
    • 0.29
    • C++ Clustering
    • None

    Description

      See also: https://bugzilla.redhat.com/show_bug.cgi?id=1086638

      Description of problem:

      qpid HA becomes unresponsive once any of the brokers are SIGSTOPed.

      There are three different cases:
      a] stopped ALL brokers
      b] stopped the primary
      c] stopped a backup

      In any of above listed cases following observations were made:

      a-c] RHCS clustat is just fine and report everything is just ok
      a-c] qpid-ha (status --all) hangs
      a,b,c*] any other clients are indefinitely blocked
      a-b] cases directly at the beginning
      c] case at the end, client able to recover after minute or so,
      due to connection timeout

      In fact this defect also proves that qpid-ha can be out of sync when compared to clustat as tracked by BZ.

      The expectations are:

      • a] quorum lost HA down (same as kill -9 to all nodes)
        no clients able to communicate
      • b] promotion of new primary, there has to be mechanism to get rid of stopped process
        clients should be able to communicate after recovery
      • c] unresponsive backup should get restarted
        clients should be able to communicate after duration when backup is detected as unresponsive
      • Generally better integration Qpid HA environment <-> RHCS is needed
        aka SIGSTOP detection
      • Heartbeat primary <-> backups probably needed

      Attachments

        1. ha-heartbeat.diff
          14 kB
          Alan Conway

        Activity

          People

            aconway Alan Conway
            aconway Alan Conway
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: