[Pacemaker] Cluster Refuses to Stop/Shutdown
Remi Broemeling
remi at nexopia.com
Thu Sep 24 23:27:54 UTC 2009
Ok, thanks for the note Steven. I've filed the bug, it is #525589.
Steven Dake wrote:
> Remi,
>
> Likely a defect. We will have to look into it. Please file a bug as
> per instructions on the corosync wiki at www.corosync.org.
>
> On Thu, 2009-09-24 at 16:47 -0600, Remi Broemeling wrote:
>
>> I've spent all day working on this; even going so far as to completely
>> build my own set of packages from the Debian-available ones (which
>> appear to be different than the Ubuntu-available ones). It didn't
>> have any effect on the issue at all: the cluster still freaks out and
>> becomes a split-brain after a single SIGQUIT.
>>
>> The debian packages that also demonstrate this behavior were the below
>> versions:
>> cluster-glue_1.0+hg20090915-1~bpo50+1_i386.deb
>> corosync_1.0.0-5~bpo50+1_i386.deb
>> libcorosync4_1.0.0-5~bpo50+1_i386.deb
>> libopenais3_1.0.0-4~bpo50+1_i386.deb
>> openais_1.0.0-4~bpo50+1_i386.deb
>> pacemaker-openais_1.0.5+hg20090915-1~bpo50+1_i386.deb
>>
>> These packages were re-built (under Ubuntu Hardy Heron LTS) from the
>> *.diff.gz, *.dsc, and *.orig.tar.gz files available at
>> http://people.debian.org/~madkiss/ha-corosync, and as I said the
>> symptoms remain exactly the same, both under the configuration that I
>> list below and the sample configuration that came with these packages.
>> I also attempted the same with a single IP Address resource associated
>> with the cluster; just to be sure it wasn't an edge case for a cluster
>> with no resources; but again that had no effect.
>>
>> Basically I'm still exactly at the point that I was at yesterday
>> morning at about 0900.
>>
>> Remi Broemeling wrote:
>>
>>> I posted this to the OpenAIS Mailing List
>>> (openais at lists.linux-foundation.org) yesterday, but haven't received
>>> a response and upon further reflection I think that maybe I chose
>>> the wrong list to post it to. That list seems to be far less about
>>> user support and far more about developer communication. Therefore
>>> re-trying here, as the archives show it to be somewhat more
>>> user-focused.
>>>
>>> The problem is that I'm having an issue with corosync refusing to
>>> shutdown in response to a QUIT signal. Given the below cluster
>>> (output of crm_mon):
>>>
>>> ============
>>> Last updated: Wed Sep 23 15:56:24 2009
>>> Stack: openais
>>> Current DC: boot1 - partition with quorum
>>> Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
>>> 2 Nodes configured, 2 expected votes
>>> 0 Resources configured.
>>> ============
>>>
>>> Online: [ boot1 boot2 ]
>>>
>>> If I go onto the host 'boot2', and issue the command "killall -QUIT
>>> corosync", the anticipated result would be that boot2 would go
>>> offline (out of the cluster), and all of the cluster processes
>>> (corosync/stonithd/cib/lrmd/attrd/pengine/crmd) would shut-down.
>>> However, this is not occurring, and I don't really have any idea
>>> why. After logging into boot2, and issuing the command "killall
>>> -QUIT corosync", the result is a split-brain:
>>>
>>> From boot1's viewpoint:
>>> ============
>>> Last updated: Wed Sep 23 15:58:27 2009
>>> Stack: openais
>>> Current DC: boot1 - partition WITHOUT quorum
>>> Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
>>> 2 Nodes configured, 2 expected votes
>>> 0 Resources configured.
>>> ============
>>>
>>> Online: [ boot1 ]
>>> OFFLINE: [ boot2 ]
>>>
>>> From boot2's viewpoint:
>>> ============
>>> Last updated: Wed Sep 23 15:58:35 2009
>>> Stack: openais
>>> Current DC: boot1 - partition with quorum
>>> Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
>>> 2 Nodes configured, 2 expected votes
>>> 0 Resources configured.
>>> ============
>>>
>>> Online: [ boot1 boot2 ]
>>>
>>> At this point the status quo holds until such time as ANOTHER QUIT
>>> signal is sent to corosync, (i.e. the command "killall -QUIT
>>> corosync" is executed on boot2 again). Then, boot2 shuts down
>>> properly and everything appears to be kosher. Basically, what I
>>> expect to happen after a single QUIT signal is instead taking two
>>> QUIT signals to occur; and that summarizes my question: why does it
>>> take two QUIT signals to force corosync to actually shutdown? Is
>>> that desired behavior? From everything online that I have read it
>>> seems to be very strange, and it makes me think that I have a
>>> problem in my configuration(s), but I've no idea what that would be
>>> even after playing with things and investigating for the day.
>>>
>>> I would be very grateful for any guidance that could be provided, as
>>> at the moment I seem to be at an impasse.
>>>
>>> Log files, with debugging set to 'on', can be found at the following
>>> pastebin locations:
>>> After first QUIT signal issued on boot2:
>>> boot1:/var/log/syslog: http://pastebin.com/m7f9a61fd
>>> boot2:/var/log/syslog: http://pastebin.com/d26fdfee
>>> After second QUIT signal issued on boot2:
>>> boot1:/var/log/syslog: http://pastebin.com/m755fb989
>>> boot2:/var/log/syslog: http://pastebin.com/m22dcef45
>>>
>>> OS, Software Packages, and Versions:
>>> * two nodes, each running Ubuntu Hardy Heron LTS
>>> * ubuntu-ha packages, as downloaded from
>>> http://ppa.launchpad.net/ubuntu-ha-maintainers/ppa/ubuntu/:
>>> * pacemaker-openais package version 1.0.5
>>> +hg20090813-0ubuntu2~hardy1
>>> * openais package version 1.0.0-3ubuntu1~hardy1
>>> * corosync package version 1.0.0-4ubuntu1~hardy2
>>> * heartbeat-common package version heartbeat-common_2.99.2
>>> +sles11r9-5ubuntu1~hardy1
>>>
>>> Network Setup:
>>> * boot1
>>> * eth0 is 192.168.10.192
>>> * eth1 is 172.16.1.1
>>> * boot2
>>> * eth0 is 192.168.10.193
>>> * eth1 is 172.16.1.2
>>> * boot1:eth0 and boot2:eth0 both connect to the same switch.
>>> * boot1:eth1 and boot2:eth1 are connected directly to each other
>>> via a cross-over cable.
>>> * no firewalls are involved, and tcpdump shows the multicast and
>>> UDP traffic flowing correctly over these links.
>>> * I attempted a broadcast (rather than multicast) configuration,
>>> to see if that would fix the problem. It did not.
>>>
>>> `crm configure show` output:
>>> node boot1
>>> node boot2
>>> property $id="cib-bootstrap-options" \
>>>
>>> dc-version="1.0.5-3840e6b5a305ccb803d29b468556739e75532d56" \
>>> cluster-infrastructure="openais" \
>>> expected-quorum-votes="2" \
>>> stonith-enabled="false" \
>>> no-quorum-policy="ignore"
>>>
>>> Contents of /etc/corosync/corosync.conf:
>>> # Please read the corosync.conf.5 manual page
>>> compatibility: whitetank
>>>
>>> totem {
>>> clear_node_high_bit: yes
>>> version: 2
>>> secauth: on
>>> threads: 1
>>> heartbeat_failures_allowed: 3
>>> interface {
>>> ringnumber: 0
>>> bindnetaddr: 172.16.1.0
>>> mcastaddr: 239.42.0.1
>>> mcastport: 5505
>>> }
>>> interface {
>>> ringnumber: 1
>>> bindnetaddr: 192.168.10.0
>>> mcastaddr: 239.42.0.2
>>> mcastport: 6606
>>> }
>>> rrp_mode: passive
>>> }
>>>
>>> amf {
>>> mode: disabled
>>> }
>>>
>>> service {
>>> name: pacemaker
>>> ver: 0
>>> }
>>>
>>> aisexec {
>>> user: root
>>> group: root
>>> }
>>>
>>> logging {
>>> debug: on
>>> fileline: off
>>> function_name: off
>>> to_logfile: no
>>> to_stderr: no
>>> to_syslog: yes
>>> timestamp: on
>>> logger_subsys {
>>> subsys: AMF
>>> debug: off
>>> tags: enter|leave|trace1|trace2|trace3|trace4|trace6
>>> }
>>> }
>>>
--
Remi Broemeling
Sr System Administrator
Nexopia.com Inc.
direct: 780 444 1250 ext 435
email: remi at nexopia.com <mailto:remi at nexopia.com>
fax: 780 487 0376
www.nexopia.com <http://www.nexopia.com>
ICMP: The protocol that goes PING!
www.coolsigs.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20090924/24088fce/attachment-0002.htm>
More information about the Pacemaker
mailing list