[Pacemaker] Cluster Refuses to Stop/Shutdown
Steven Dake
sdake at redhat.com
Thu Sep 24 22:47:54 UTC 2009
Remi,
Likely a defect. We will have to look into it. Please file a bug as
per instructions on the corosync wiki at www.corosync.org.
On Thu, 2009-09-24 at 16:47 -0600, Remi Broemeling wrote:
> I've spent all day working on this; even going so far as to completely
> build my own set of packages from the Debian-available ones (which
> appear to be different than the Ubuntu-available ones). It didn't
> have any effect on the issue at all: the cluster still freaks out and
> becomes a split-brain after a single SIGQUIT.
>
> The debian packages that also demonstrate this behavior were the below
> versions:
> cluster-glue_1.0+hg20090915-1~bpo50+1_i386.deb
> corosync_1.0.0-5~bpo50+1_i386.deb
> libcorosync4_1.0.0-5~bpo50+1_i386.deb
> libopenais3_1.0.0-4~bpo50+1_i386.deb
> openais_1.0.0-4~bpo50+1_i386.deb
> pacemaker-openais_1.0.5+hg20090915-1~bpo50+1_i386.deb
>
> These packages were re-built (under Ubuntu Hardy Heron LTS) from the
> *.diff.gz, *.dsc, and *.orig.tar.gz files available at
> http://people.debian.org/~madkiss/ha-corosync, and as I said the
> symptoms remain exactly the same, both under the configuration that I
> list below and the sample configuration that came with these packages.
> I also attempted the same with a single IP Address resource associated
> with the cluster; just to be sure it wasn't an edge case for a cluster
> with no resources; but again that had no effect.
>
> Basically I'm still exactly at the point that I was at yesterday
> morning at about 0900.
>
> Remi Broemeling wrote:
> > I posted this to the OpenAIS Mailing List
> > (openais at lists.linux-foundation.org) yesterday, but haven't received
> > a response and upon further reflection I think that maybe I chose
> > the wrong list to post it to. That list seems to be far less about
> > user support and far more about developer communication. Therefore
> > re-trying here, as the archives show it to be somewhat more
> > user-focused.
> >
> > The problem is that I'm having an issue with corosync refusing to
> > shutdown in response to a QUIT signal. Given the below cluster
> > (output of crm_mon):
> >
> > ============
> > Last updated: Wed Sep 23 15:56:24 2009
> > Stack: openais
> > Current DC: boot1 - partition with quorum
> > Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
> > 2 Nodes configured, 2 expected votes
> > 0 Resources configured.
> > ============
> >
> > Online: [ boot1 boot2 ]
> >
> > If I go onto the host 'boot2', and issue the command "killall -QUIT
> > corosync", the anticipated result would be that boot2 would go
> > offline (out of the cluster), and all of the cluster processes
> > (corosync/stonithd/cib/lrmd/attrd/pengine/crmd) would shut-down.
> > However, this is not occurring, and I don't really have any idea
> > why. After logging into boot2, and issuing the command "killall
> > -QUIT corosync", the result is a split-brain:
> >
> > From boot1's viewpoint:
> > ============
> > Last updated: Wed Sep 23 15:58:27 2009
> > Stack: openais
> > Current DC: boot1 - partition WITHOUT quorum
> > Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
> > 2 Nodes configured, 2 expected votes
> > 0 Resources configured.
> > ============
> >
> > Online: [ boot1 ]
> > OFFLINE: [ boot2 ]
> >
> > From boot2's viewpoint:
> > ============
> > Last updated: Wed Sep 23 15:58:35 2009
> > Stack: openais
> > Current DC: boot1 - partition with quorum
> > Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
> > 2 Nodes configured, 2 expected votes
> > 0 Resources configured.
> > ============
> >
> > Online: [ boot1 boot2 ]
> >
> > At this point the status quo holds until such time as ANOTHER QUIT
> > signal is sent to corosync, (i.e. the command "killall -QUIT
> > corosync" is executed on boot2 again). Then, boot2 shuts down
> > properly and everything appears to be kosher. Basically, what I
> > expect to happen after a single QUIT signal is instead taking two
> > QUIT signals to occur; and that summarizes my question: why does it
> > take two QUIT signals to force corosync to actually shutdown? Is
> > that desired behavior? From everything online that I have read it
> > seems to be very strange, and it makes me think that I have a
> > problem in my configuration(s), but I've no idea what that would be
> > even after playing with things and investigating for the day.
> >
> > I would be very grateful for any guidance that could be provided, as
> > at the moment I seem to be at an impasse.
> >
> > Log files, with debugging set to 'on', can be found at the following
> > pastebin locations:
> > After first QUIT signal issued on boot2:
> > boot1:/var/log/syslog: http://pastebin.com/m7f9a61fd
> > boot2:/var/log/syslog: http://pastebin.com/d26fdfee
> > After second QUIT signal issued on boot2:
> > boot1:/var/log/syslog: http://pastebin.com/m755fb989
> > boot2:/var/log/syslog: http://pastebin.com/m22dcef45
> >
> > OS, Software Packages, and Versions:
> > * two nodes, each running Ubuntu Hardy Heron LTS
> > * ubuntu-ha packages, as downloaded from
> > http://ppa.launchpad.net/ubuntu-ha-maintainers/ppa/ubuntu/:
> > * pacemaker-openais package version 1.0.5
> > +hg20090813-0ubuntu2~hardy1
> > * openais package version 1.0.0-3ubuntu1~hardy1
> > * corosync package version 1.0.0-4ubuntu1~hardy2
> > * heartbeat-common package version heartbeat-common_2.99.2
> > +sles11r9-5ubuntu1~hardy1
> >
> > Network Setup:
> > * boot1
> > * eth0 is 192.168.10.192
> > * eth1 is 172.16.1.1
> > * boot2
> > * eth0 is 192.168.10.193
> > * eth1 is 172.16.1.2
> > * boot1:eth0 and boot2:eth0 both connect to the same switch.
> > * boot1:eth1 and boot2:eth1 are connected directly to each other
> > via a cross-over cable.
> > * no firewalls are involved, and tcpdump shows the multicast and
> > UDP traffic flowing correctly over these links.
> > * I attempted a broadcast (rather than multicast) configuration,
> > to see if that would fix the problem. It did not.
> >
> > `crm configure show` output:
> > node boot1
> > node boot2
> > property $id="cib-bootstrap-options" \
> >
> > dc-version="1.0.5-3840e6b5a305ccb803d29b468556739e75532d56" \
> > cluster-infrastructure="openais" \
> > expected-quorum-votes="2" \
> > stonith-enabled="false" \
> > no-quorum-policy="ignore"
> >
> > Contents of /etc/corosync/corosync.conf:
> > # Please read the corosync.conf.5 manual page
> > compatibility: whitetank
> >
> > totem {
> > clear_node_high_bit: yes
> > version: 2
> > secauth: on
> > threads: 1
> > heartbeat_failures_allowed: 3
> > interface {
> > ringnumber: 0
> > bindnetaddr: 172.16.1.0
> > mcastaddr: 239.42.0.1
> > mcastport: 5505
> > }
> > interface {
> > ringnumber: 1
> > bindnetaddr: 192.168.10.0
> > mcastaddr: 239.42.0.2
> > mcastport: 6606
> > }
> > rrp_mode: passive
> > }
> >
> > amf {
> > mode: disabled
> > }
> >
> > service {
> > name: pacemaker
> > ver: 0
> > }
> >
> > aisexec {
> > user: root
> > group: root
> > }
> >
> > logging {
> > debug: on
> > fileline: off
> > function_name: off
> > to_logfile: no
> > to_stderr: no
> > to_syslog: yes
> > timestamp: on
> > logger_subsys {
> > subsys: AMF
> > debug: off
> > tags: enter|leave|trace1|trace2|trace3|trace4|trace6
> > }
> > }
>
> --
>
> Remi Broemeling
> Sr System Administrator
>
> Nexopia.com Inc.
> direct: 780 444 1250 ext 435
> email: remi at nexopia.com
> fax: 780 487 0376
>
> www.nexopia.com
>
> On going to war over religion: "You're basically killing each other to
> see who's got the better imaginary friend."
> Rich Jeni
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
More information about the Pacemaker
mailing list