[Pacemaker] How to prevent a node that joins the cluster after reboot from starting the resources.

Thu Aug 25 19:14:23 CET 2011

Hello,

On Mon, Aug 22, 2011 at 2:55 AM, ihjaz Mohamed <ihjazmohamed at yahoo.co.in>wrote:

> Hi,
>
> Has any one here come across this issue?.
>
>

Sorry for the delay, but I wanted to respond and let you know that I'm also
having this issue.  I can pretty reliably kill a pretty simple cluster setup
by rebooting one of the nodes.  When the rebooted node comes back up and
starts pacemaker, it instantly tries to start all services on itself,
ignoring that they're running happily and healthily on the other node and
resource stickiness is configured at 1000.  The result is none of the
resources running anywhere... they become unmanaged and crm status shows
that it thinks they are running on the freshly rebooted node.  If pacemaker
can be configured for a delay on startup before it tries to run services, I
think even 5 seconds of time would be enough for it to realize that it
should definitely not start anything at all.  I haven't been able to find a
setting that accomplishes that, though.

The cluster is a pretty simple one, trying to test the VirtualDomain RA
which in and of itself has given me fits (empty state files... why do they
get emptied rather than removed, which prevents the VM from starting until
you manually re-populate the state file no matter how many 'resource
cleanup' attempts you make?), but that is for another troubleshooting
session.  This problem is my biggie, because a healthy surviving node has
all resources forced off of it and killed by a rebooted one.

Has anybody else been running into this, or are we just two unlucky fellas?

This is currently on CentOS 6.0 with all updates (had the same issue on
Scientific Linux 6.1 so rolled back onto CentOS for consistency since all
other machines here are on it).  Both 'cman' and 'pacemaker' configured to
start at boot.  I'll throw cluster.conf and 'crm configure show' output on
the end of this in case it'll help someone spot a glaring mistake on my part
(which I'd love it to be at this point, as that is easily fixed).

Regards,
Mark

<?xml version="1.0"?>
<cluster config_version="1" name="KVMCluster">
    <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
    <clusternodes>
            <clusternode name="kvm1" nodeid="1" votes="1">
                    <fence/>
            </clusternode>
            <clusternode name="kvm2" nodeid="2" votes="1">
                    <fence/>
            </clusternode>
    </clusternodes>
    <cman/>
    <fencedevices/>
    <rm/>
</cluster>

node kvm1
node kvm2
primitive apache1 ocf:heartbeat:VirtualDomain \
params config="/etc/libvirt/qemu/apache1.xml" \
meta allow-migrate="true" is-managed="true" \
op start interval="0" timeout="90" \
op stop interval="0" timeout="90" \
op migrate_to interval="0" timeout="120" \
op migrate_from interval="0" timeout="60"
primitive fw1 ocf:heartbeat:VirtualDomain \
params config="/etc/libvirt/qemu/fw1.xml" \
meta allow-migrate="true" is-managed="true" \
op start interval="0" timeout="90" \
op stop interval="0" timeout="90" \
op migrate_to interval="0" timeout="120" \
op migrate_from interval="0" timeout="60"
primitive vgClusterDisk ocf:heartbeat:LVM \
params volgrpname="vgClusterDisk" \
op start interval="0" timeout="30" \
op stop interval="0" timeout="120"
clone shared_volgrp vgClusterDisk \
meta target-role="Started" is-managed="true"
order storage_then_VMs inf: shared_volgrp ( fw1 apache1 )
property $id="cib-bootstrap-options" \
dc-version="1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe" \
cluster-infrastructure="cman" \
no-quorum-policy="ignore" \
stonith-action="reboot" \
stonith-timeout="30s" \
maintenance-mode="false" \
pe-error-series-max="5000" \
pe-warn-series-max="5000" \
pe-input-series-max="5000" \
dc-deadtime="2min" \
stonith-enabled="false" \
last-lrm-refresh="1314294568"
rsc_defaults $id="rsc-options" \
resource-stickiness="1000"

> --- On *Wed, 17/8/11, ihjaz Mohamed <ihjazmohamed at yahoo.co.in>* wrote:
>
>
> From: ihjaz Mohamed <ihjazmohamed at yahoo.co.in>
> Subject: [Pacemaker] How to prevent a node that joins the cluster after
> reboot from starting the resources.
> To: pacemaker at oss.clusterlabs.org
> Date: Wednesday, 17 August, 2011, 12:23 PM
>
> Hi All,
>
> Am getting an unmanaged error as shown below when one of the node is
> rebooted and comes back to join the cluster.
>
> *Online: [ aceblr101.com aceblr107.com ]
>
>  Resource Group: HAService
>      FloatingIP (ocf::heartbeat:IPaddr2):       Started aceblr107.com(unmanaged) FAILED
>      acestatus  (lsb:acestatus):        Stopped
>  Clone Set: pingdclone
>      Started: [ aceblr101.com aceblr107.com ]
>
> Failed actions:
>     FloatingIP_stop_0 (node=aceblr107.com, call=7, rc=1, status=complete):
> unknown error
> *Below is my configuration:*node
> $id="8bf8e613-f63c-43a6-8915-4b2dbf72a4a5" aceblr101.com
> node $id="bde62a1f-0f29-4357-a988-0e26bb06c4fb" aceblr107.com
> primitive FloatingIP ocf:heartbeat:IPaddr2 \
>         params ip="xx.xxx.xxx.xxx" nic="eth0:0"
> primitive acestatus lsb:acestatus \
>         op start interval="30"
> primitive pingd ocf:pacemaker:pingd \
>         params host_list="xx.xxx.xxx.1" multiplier="100" \
>         op monitor interval="15s" timeout="5s"
> group HAService FloatingIP acestatus \
>         meta target-role="Started"
> clone pingdclone pingd \
>         meta globally-unique="false"
> location ip1_location FloatingIP \
>         rule $id="ip1_location-rule" pingd: defined pingd
> property $id="cib-bootstrap-options" \
>         dc-version="1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87" \
>         cluster-infrastructure="Heartbeat" \
>         expected-quorum-votes="2" \
>         stonith-enabled="false" \
>         no-quorum-policy="ignore" \
>         last-lrm-refresh="1305736421"*I see from the logs that when the
> rebooted node comes back and joins the cluster, the resources on that node
> is getting started even though the resources are started on the existing
> node.
>
> When resources on both nodes are started it tries to stop it on one of the
> node which fails and goes to unmanaged mode.
>
> Could anyone help me on how I should configure so that the resources are
> not started on the new node that joins the cluster after a reboot when it is
> already started on the existing node.
>
> -----Inline Attachment Follows-----
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org<http://mc/compose?to=Pacemaker@oss.clusterlabs.org>
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20110825/546cbbf0/attachment.html>