Uploaded image for project: 'Qpid'
  1. Qpid
  2. QPID-5007

Qpid HA cluster does not support failback in an ordered domain.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.22
    • None
    • C++ Clustering
    • None

    Description

      rgmanager has the notion of an ordered domain, where it will try to start services on the highest priority node in the domain.
      (see https://fedorahosted.org/cluster/wiki/FailoverDomains)

      The problem arises like this:

      • start a 2 node cluster with an ordered domain.
      • Create a queue and put and put enough messages on so that catchup takes longer than the time to restart node1
      • kill node1, rgmanager relocates qpidd-primary service to node2
      • immediately restart node1
      • rgmanager wants to relocate the service to node1 so it:
      • kills the primary on node2 as first step of relocation
      • attempts to restart the primary on node1 which fails
        because it is still in catchup and there is no primary to catch up
        from.
      • at this point we get into an infinite loop of failed attempts to
        restart the primary.

      The workaround is to set the nofailback option on the domain.

      See also: https://bugzilla.redhat.com/show_bug.cgi?id=970657

      Attachments

        Activity

          People

            Unassigned Unassigned
            aconway Alan Conway
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated: