[Pacemaker] Cluster Refuses to Stop/Shutdown

Thu Sep 24 23:27:54 UTC 2009

Ok, thanks for the note Steven.  I've filed the bug, it is #525589.

Steven Dake wrote:
> Remi,
>
> Likely a defect.  We will have to look into it.  Please file a bug as
> per instructions on the corosync wiki at www.corosync.org.
>
> On Thu, 2009-09-24 at 16:47 -0600, Remi Broemeling wrote:
>   
>> I've spent all day working on this; even going so far as to completely
>> build my own set of packages from the Debian-available ones (which
>> appear to be different than the Ubuntu-available ones).  It didn't
>> have any effect on the issue at all: the cluster still freaks out and
>> becomes a split-brain after a single SIGQUIT.
>>
>> The debian packages that also demonstrate this behavior were the below
>> versions:
>>     cluster-glue_1.0+hg20090915-1~bpo50+1_i386.deb
>>     corosync_1.0.0-5~bpo50+1_i386.deb
>>     libcorosync4_1.0.0-5~bpo50+1_i386.deb
>>     libopenais3_1.0.0-4~bpo50+1_i386.deb
>>     openais_1.0.0-4~bpo50+1_i386.deb
>>     pacemaker-openais_1.0.5+hg20090915-1~bpo50+1_i386.deb
>>
>> These packages were re-built (under Ubuntu Hardy Heron LTS) from the
>> *.diff.gz, *.dsc, and *.orig.tar.gz files available at
>> http://people.debian.org/~madkiss/ha-corosync, and as I said the
>> symptoms remain exactly the same, both under the configuration that I
>> list below and the sample configuration that came with these packages.
>> I also attempted the same with a single IP Address resource associated
>> with the cluster; just to be sure it wasn't an edge case for a cluster
>> with no resources; but again that had no effect.
>>
>> Basically I'm still exactly at the point that I was at yesterday
>> morning at about 0900.
>>
>> Remi Broemeling wrote: 
>>     
>>> I posted this to the OpenAIS Mailing List
>>> (openais at lists.linux-foundation.org) yesterday, but haven't received
>>> a response and upon further reflection I think that maybe I chose
>>> the wrong list to post it to.  That list seems to be far less about
>>> user support and far more about developer communication.  Therefore
>>> re-trying here, as the archives show it to be somewhat more
>>> user-focused.
>>>
>>> The problem is that I'm having an issue with corosync refusing to
>>> shutdown in response to a QUIT signal.  Given the below cluster
>>> (output of crm_mon):
>>>
>>> ============
>>> Last updated: Wed Sep 23 15:56:24 2009
>>> Stack: openais
>>> Current DC: boot1 - partition with quorum
>>> Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
>>> 2 Nodes configured, 2 expected votes
>>> 0 Resources configured.
>>> ============
>>>
>>> Online: [ boot1 boot2 ]
>>>
>>> If I go onto the host 'boot2', and issue the command "killall -QUIT
>>> corosync", the anticipated result would be that boot2 would go
>>> offline (out of the cluster), and all of the cluster processes
>>> (corosync/stonithd/cib/lrmd/attrd/pengine/crmd) would shut-down.
>>> However, this is not occurring, and I don't really have any idea
>>> why.  After logging into boot2, and issuing the command "killall
>>> -QUIT corosync", the result is a split-brain:
>>>
>>> From boot1's viewpoint:
>>> ============
>>> Last updated: Wed Sep 23 15:58:27 2009
>>> Stack: openais
>>> Current DC: boot1 - partition WITHOUT quorum
>>> Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
>>> 2 Nodes configured, 2 expected votes
>>> 0 Resources configured.
>>> ============
>>>
>>> Online: [ boot1 ]
>>> OFFLINE: [ boot2 ]
>>>
>>> From boot2's viewpoint:
>>> ============
>>> Last updated: Wed Sep 23 15:58:35 2009
>>> Stack: openais
>>> Current DC: boot1 - partition with quorum
>>> Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
>>> 2 Nodes configured, 2 expected votes
>>> 0 Resources configured.
>>> ============
>>>
>>> Online: [ boot1 boot2 ]
>>>
>>> At this point the status quo holds until such time as ANOTHER QUIT
>>> signal is sent to corosync, (i.e. the command "killall -QUIT
>>> corosync" is executed on boot2 again).  Then, boot2 shuts down
>>> properly and everything appears to be kosher.  Basically, what I
>>> expect to happen after a single QUIT signal is instead taking two
>>> QUIT signals to occur; and that summarizes my question: why does it
>>> take two QUIT signals to force corosync to actually shutdown?  Is
>>> that desired behavior?  From everything online that I have read it
>>> seems to be very strange, and it makes me think that I have a
>>> problem in my configuration(s), but I've no idea what that would be
>>> even after playing with things and investigating for the day.
>>>
>>> I would be very grateful for any guidance that could be provided, as
>>> at the moment I seem to be at an impasse.
>>>
>>> Log files, with debugging set to 'on', can be found at the following
>>> pastebin locations:
>>>     After first QUIT signal issued on boot2:
>>>         boot1:/var/log/syslog: http://pastebin.com/m7f9a61fd
>>>         boot2:/var/log/syslog: http://pastebin.com/d26fdfee
>>>     After second QUIT signal issued on boot2:
>>>         boot1:/var/log/syslog: http://pastebin.com/m755fb989
>>>         boot2:/var/log/syslog: http://pastebin.com/m22dcef45
>>>
>>> OS, Software Packages, and Versions:
>>>     * two nodes, each running Ubuntu Hardy Heron LTS
>>>     * ubuntu-ha packages, as downloaded from
>>> http://ppa.launchpad.net/ubuntu-ha-maintainers/ppa/ubuntu/:
>>>         * pacemaker-openais package version 1.0.5
>>> +hg20090813-0ubuntu2~hardy1
>>>         * openais package version 1.0.0-3ubuntu1~hardy1
>>>         * corosync package version 1.0.0-4ubuntu1~hardy2
>>>         * heartbeat-common package version heartbeat-common_2.99.2
>>> +sles11r9-5ubuntu1~hardy1
>>>
>>> Network Setup:
>>>     * boot1
>>>         * eth0 is 192.168.10.192
>>>         * eth1 is 172.16.1.1
>>>     * boot2
>>>         * eth0 is 192.168.10.193
>>>         * eth1 is 172.16.1.2
>>>     * boot1:eth0 and boot2:eth0 both connect to the same switch.
>>>     * boot1:eth1 and boot2:eth1 are connected directly to each other
>>> via a cross-over cable.
>>>     * no firewalls are involved, and tcpdump shows the multicast and
>>> UDP traffic flowing correctly over these links.
>>>     * I attempted a broadcast (rather than multicast) configuration,
>>> to see if that would fix the problem.  It did not.
>>>
>>> `crm configure show` output:
>>>     node boot1
>>>     node boot2
>>>     property $id="cib-bootstrap-options" \
>>>
>>> dc-version="1.0.5-3840e6b5a305ccb803d29b468556739e75532d56" \
>>>             cluster-infrastructure="openais" \
>>>             expected-quorum-votes="2" \
>>>             stonith-enabled="false" \
>>>             no-quorum-policy="ignore"
>>>
>>> Contents of /etc/corosync/corosync.conf:
>>>     # Please read the corosync.conf.5 manual page
>>>     compatibility: whitetank
>>>
>>>     totem {
>>>         clear_node_high_bit: yes
>>>         version: 2
>>>         secauth: on
>>>         threads: 1
>>>         heartbeat_failures_allowed: 3
>>>         interface {
>>>                 ringnumber: 0
>>>                 bindnetaddr: 172.16.1.0
>>>                 mcastaddr: 239.42.0.1
>>>                 mcastport: 5505
>>>         }
>>>         interface {
>>>                 ringnumber: 1
>>>                 bindnetaddr: 192.168.10.0
>>>                 mcastaddr: 239.42.0.2
>>>                 mcastport: 6606
>>>         }
>>>         rrp_mode: passive
>>>     }
>>>
>>>     amf {
>>>         mode: disabled
>>>     }
>>>
>>>     service {
>>>         name: pacemaker
>>>         ver: 0
>>>     }
>>>
>>>     aisexec {
>>>         user: root
>>>         group: root
>>>     }
>>>
>>>     logging {
>>>         debug: on
>>>         fileline: off
>>>         function_name: off
>>>         to_logfile: no
>>>         to_stderr: no
>>>         to_syslog: yes
>>>         timestamp: on
>>>         logger_subsys {
>>>                 subsys: AMF
>>>                 debug: off
>>>                 tags: enter|leave|trace1|trace2|trace3|trace4|trace6
>>>         }
>>>     }
>>>       

-- 

Remi Broemeling
Sr System Administrator

Nexopia.com Inc.
direct: 780 444 1250 ext 435
email: remi at nexopia.com <mailto:remi at nexopia.com>
fax: 780 487 0376

www.nexopia.com <http://www.nexopia.com>

ICMP: The protocol that goes PING!
www.coolsigs.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20090924/24088fce/attachment-0002.htm>