[Pacemaker] Cluster goes to (unmanaged) Failed state when both nodes are rebooted together

Mon Oct 24 15:08:15 UTC 2011

Hi,

On Mon, Oct 24, 2011 at 9:52 AM, Alan Robertson <alanr at unix.sh> wrote:

> **
> Setting no-quorum-policy to ignore and disabling stonith is not a good
> idea.  You're sort of inviting the cluster to do screwed up things.
>
>
>
Isn't "no-quorum-policy ignore" sort of required for a two-node cluster?
 Without it, all services stop when one of your nodes gets taken offline,
which is definitely not what you want.  You can use "freeze" instead, but
then the resources for the downed node don't get started on the surviving
one.

The problem he's running into sounds like one I posted a question on a while
back, where a node returning to the cluster doesn't wait to see if services
are running elsewhere, instead it instantly tries to start all services on
itself the second corosync launches, even though they're already started,
leading to what his output shows, services in a started/unmanaged state.  I
had this while running on CentOS 6 and Scientific Linux 6.1 using pretty
much stock corosync.conf files (just adjusted for network addresses).  I
rebuilt the nodes with Debian for other reasons (Xen support and
familiarity) and as a nice side effect, that problem disappeared.

Mark

>
> On 10/24/2011 08:23 AM, ihjaz Mohamed wrote:
>
>  Hi All,
>
>  I 've pacemaker running with corosync. Following is my CRM configuration.
>
>  node soalaba56
> node soalaba63
> primitive FloatingIP ocf:heartbeat:IPaddr2 \
>         params ip="<floating_ip>" nic="eth0:0"
> primitive acestatus lsb:acestatus \
> primitive pingd ocf:pacemaker:ping \
>         params host_list="<gateway_ip>" multiplier="100" \
>         op monitor interval="15s" timeout="5s"
> group HAService FloatingIP acestatus \
>         meta target-role="Started"
> clone pingdclone pingd \
>         meta globally-unique="false"
> location ip1_location FloatingIP \
>         rule $id="ip1_location-rule" pingd: defined pingd
> property $id="cib-bootstrap-options" \
>         dc-version="1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
>         cluster-infrastructure="openais" \
>         expected-quorum-votes="2" \
>         stonith-enabled="false" \
>         no-quorum-policy="ignore" \
>         last-lrm-refresh="1305736421"
> ----------------------------------------------------------------------
>
>  When I reboot both the nodes together, cluster goes into an (unmanaged)
> Failed state as shown below.
>
>
>  ============
> Last updated: Mon Oct 24 08:10:42 2011
> Stack: openais
> Current DC: soalaba63 - partition with quorum
> Version: 1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
> 2 Nodes configured, 2 expected votes
> 2 Resources configured.
> ============
>
> Online: [ soalaba56 soalaba63 ]
>
>  Resource Group: HAService
>      FloatingIP (ocf::heartbeat:IPaddr2) Started  (unmanaged) FAILED[
> soalaba63       soalaba56 ]
>      acestatus  (lsb:acestatus):        Stopped
>  Clone Set: pingdclone [pingd]
>      Started: [ soalaba56 soalaba63 ]
>
> Failed actions:
>     FloatingIP_stop_0 (node=soalaba63, call=7, rc=1, status=complete):
> unknown error
>     FloatingIP_stop_0 (node=soalaba56, call=7, rc=1, status=complete):
> unknown error
>
> ------------------------------------------------------------------------------
>
>  This happens only when the reboot is done simultaneously on both the
> nodes. If reboot is done with some interval in between this is not seen.
> Looking into the logs I see that  when the nodes come up resources are
> started on both the nodes and then it tries to stop the started resources
> and fails there.
>
>  I've attached the logs.
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.orghttp://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>
>
> --
>     Alan Robertson <alanr at unix.sh> <alanr at unix.sh>
>
> "Openness is the foundation and preservative of friendship...  Let me claim from you at all times your undisguised opinions." - William Wilberforce
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20111024/6698712b/attachment.htm>