[Pacemaker] Cluster Refuses to Stop/Shutdown
Andrew Beekhof
andrew at beekhof.net
Tue Oct 6 14:20:58 UTC 2009
I could re-paste the whole thing, but its easier to just throw up the link:
http://theclusterguy.clusterlabs.org/post/205886990/advisory-dont-use-pacemaker-on-corosync-yet
On Thu, Sep 24, 2009 at 4:56 PM, Remi Broemeling <remi at nexopia.com> wrote:
> I posted this to the OpenAIS Mailing List (
> openais at lists.linux-foundation.org) yesterday, but haven't received a
> response and upon further reflection I think that maybe I chose the wrong
> list to post it to. That list seems to be far less about user support and
> far more about developer communication. Therefore re-trying here, as the
> archives show it to be somewhat more user-focused.
>
> The problem is that I'm having an issue with corosync refusing to shutdown
> in response to a QUIT signal. Given the below cluster (output of crm_mon):
>
> ============
> Last updated: Wed Sep 23 15:56:24 2009
> Stack: openais
> Current DC: boot1 - partition with quorum
> Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
> 2 Nodes configured, 2 expected votes
> 0 Resources configured.
> ============
>
> Online: [ boot1 boot2 ]
>
> If I go onto the host 'boot2', and issue the command "killall -QUIT
> corosync", the anticipated result would be that boot2 would go offline (out
> of the cluster), and all of the cluster processes
> (corosync/stonithd/cib/lrmd/attrd/pengine/crmd) would shut-down. However,
> this is not occurring, and I don't really have any idea why. After logging
> into boot2, and issuing the command "killall -QUIT corosync", the result is
> a split-brain:
>
> From boot1's viewpoint:
> ============
> Last updated: Wed Sep 23 15:58:27 2009
> Stack: openais
> Current DC: boot1 - partition WITHOUT quorum
> Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
> 2 Nodes configured, 2 expected votes
> 0 Resources configured.
> ============
>
> Online: [ boot1 ]
> OFFLINE: [ boot2 ]
>
> From boot2's viewpoint:
> ============
> Last updated: Wed Sep 23 15:58:35 2009
> Stack: openais
> Current DC: boot1 - partition with quorum
> Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
> 2 Nodes configured, 2 expected votes
> 0 Resources configured.
> ============
>
> Online: [ boot1 boot2 ]
>
> At this point the status quo holds until such time as ANOTHER QUIT signal
> is sent to corosync, (i.e. the command "killall -QUIT corosync" is executed
> on boot2 again). Then, boot2 shuts down properly and everything appears to
> be kosher. Basically, what I expect to happen after a single QUIT signal is
> instead taking two QUIT signals to occur; and that summarizes my question:
> why does it take two QUIT signals to force corosync to actually shutdown?
> Is that desired behavior? From everything online that I have read it seems
> to be very strange, and it makes me think that I have a problem in my
> configuration(s), but I've no idea what that would be even after playing
> with things and investigating for the day.
>
> I would be very grateful for any guidance that could be provided, as at the
> moment I seem to be at an impasse.
>
> Log files, with debugging set to 'on', can be found at the following
> pastebin locations:
> After first QUIT signal issued on boot2:
> boot1:/var/log/syslog: http://pastebin.com/m7f9a61fd
> boot2:/var/log/syslog: http://pastebin.com/d26fdfee
> After second QUIT signal issued on boot2:
> boot1:/var/log/syslog: http://pastebin.com/m755fb989
> boot2:/var/log/syslog: http://pastebin.com/m22dcef45
>
> OS, Software Packages, and Versions:
> * two nodes, each running Ubuntu Hardy Heron LTS
> * ubuntu-ha packages, as downloaded from
> http://ppa.launchpad.net/ubuntu-ha-maintainers/ppa/ubuntu/:
> * pacemaker-openais package version
> 1.0.5+hg20090813-0ubuntu2~hardy1
> * openais package version 1.0.0-3ubuntu1~hardy1
> * corosync package version 1.0.0-4ubuntu1~hardy2
> * heartbeat-common package version
> heartbeat-common_2.99.2+sles11r9-5ubuntu1~hardy1
>
> Network Setup:
> * boot1
> * eth0 is 192.168.10.192
> * eth1 is 172.16.1.1
> * boot2
> * eth0 is 192.168.10.193
> * eth1 is 172.16.1.2
> * boot1:eth0 and boot2:eth0 both connect to the same switch.
> * boot1:eth1 and boot2:eth1 are connected directly to each other via a
> cross-over cable.
> * no firewalls are involved, and tcpdump shows the multicast and UDP
> traffic flowing correctly over these links.
> * I attempted a broadcast (rather than multicast) configuration, to see
> if that would fix the problem. It did not.
>
> `crm configure show` output:
> node boot1
> node boot2
> property $id="cib-bootstrap-options" \
> dc-version="1.0.5-3840e6b5a305ccb803d29b468556739e75532d56" \
> cluster-infrastructure="openais" \
> expected-quorum-votes="2" \
> stonith-enabled="false" \
> no-quorum-policy="ignore"
>
> Contents of /etc/corosync/corosync.conf:
> # Please read the corosync.conf.5 manual page
> compatibility: whitetank
>
> totem {
> clear_node_high_bit: yes
> version: 2
> secauth: on
> threads: 1
> heartbeat_failures_allowed: 3
> interface {
> ringnumber: 0
> bindnetaddr: 172.16.1.0
> mcastaddr: 239.42.0.1
> mcastport: 5505
> }
> interface {
> ringnumber: 1
> bindnetaddr: 192.168.10.0
> mcastaddr: 239.42.0.2
> mcastport: 6606
> }
> rrp_mode: passive
> }
>
> amf {
> mode: disabled
> }
>
> service {
> name: pacemaker
> ver: 0
> }
>
> aisexec {
> user: root
> group: root
> }
>
> logging {
> debug: on
> fileline: off
> function_name: off
> to_logfile: no
> to_stderr: no
> to_syslog: yes
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> tags: enter|leave|trace1|trace2|trace3|trace4|trace6
> }
> }
> --
>
> Remi Broemeling
> Sr System Administrator
>
> Nexopia.com Inc.
> direct: 780 444 1250 ext 435
> email: remi at nexopia.com
> fax: 780 487 0376
>
> [image: www.nexopia.com] <http://www.nexopia.com>
>
> Cat toys, n.: Anything not nailed down, and some that are.
> http://www.fortlangley.ca/pepin/taglines.html
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20091006/ded42114/attachment-0001.html>
More information about the Pacemaker
mailing list