[Pacemaker] standby attribute and same resources running at the same time

Tue Mar 5 23:14:18 EST 2013

On Tue, Mar 5, 2013 at 4:20 AM, Leon Fauster <leonfauster at googlemail.com> wrote:
> Dear list,
>
> just to excuse the triviality - i started to deploy a ha environment
> in a test lab and therefore i do not have much experience.
>
>
>
> i started to setup a 2-node cluster
>
>   corosync-1.4.1-15.el6.x86_64
>   pacemaker-1.1.8-7.el6.x86_64
>   cman-3.0.12.1-49.el6.x86_64
>
> with rhel6.3 and then switched to rhel6.4.
>
> This update brings some differences. The crm shell is gone and pcs is added.
> Anyway i found some equivalent commands to setup/configure resources.
>
> So far all good. I am doing some stress test now and noticed that rebooting
> one node (n2), that node (n2) will be marked as standby in the cib (shown on the
> other node (n1)).
>
> After rebooting the node (n2) crm_mon on that node shows that the other node (n1)
> is offline and begins to start the ressources. While the other node (n1) that wasn't
> rebooted still shows n2 as standby. At that point both nodes are runnnig the "same"
> resources. After a couple of minutes that situation is noticed and both nodes
> renegotiate the current state. Then one node take over the responsibility to provide
> the resources. On both nodes the previously rebooted node is still listed as standby.
>
>
>   cat /var/log/messages |grep error
>   Mar  4 17:32:33 cn1 pengine[1378]:    error: native_create_actions: Resource resIP (ocf::IPaddr2) is active on 2 nodes attempting recovery
>   Mar  4 17:32:33 cn1 pengine[1378]:    error: native_create_actions: Resource resApache (ocf::apache) is active on 2 nodes attempting recovery
>   Mar  4 17:32:33 cn1 pengine[1378]:    error: process_pe_message: Calculated Transition 1: /var/lib/pacemaker/pengine/pe-error-6.bz2
>   Mar  4 17:32:48 cn1 crmd[1379]:   notice: run_graph: Transition 1 (Complete=9, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-error-6.bz2): Complete
>
>
>   crm_mon -1
>   Last updated: Mon Mar  4 17:49:08 2013
>   Last change: Mon Mar  4 10:22:53 2013 via crm_resource on cn1.localdomain
>   Stack: cman
>   Current DC: cn1.localdomain - partition with quorum
>   Version: 1.1.8-7.el6-394e906
>   2 Nodes configured, 2 expected votes
>   2 Resources configured.
>
>   Node cn2.localdomain: standby
>   Online: [ cn1.localdomain ]
>
>   resIP (ocf::heartbeat:IPaddr2):       Started cn1.localdomain
>   resApache     (ocf::heartbeat:apache):        Started cn1.localdomain
>
>
> i checked the init scripts and found that the standby "behavior" comes
> from a function that is called on "service pacemaker stop" (added in rhel6.4).
>
> cman_pre_stop()
> {
>     cname=`crm_node --name`
>     crm_attribute -N $cname -n standby -v true -l reboot
>     echo -n "Waiting for shutdown of managed resources"
> ...

That will only last until the node comes back (the cluster will remove
it automatically), the core problem is that it appears not to have.
Can you file a bug and attach a crm_report for the period covered by
the restart?

>
> i could not delete the standby attribute with
>
> crm_attribute -G --node=cn2.localdomain -n standby
>
>
>
> okay - recap:
>
> 1st. i have this delay where the two nodes dont see each
> other (after rebooting) and the result are resources running on both
> nodes while they should only run on one node - this will be corrected
> by the cluster it self but this situation should not happen.
>
> 2nd. the standby attribute (and there must be a reason why redhat
> added this) will prevent to migrate resources to that node. How
> do i delete this attribute?
>
> i appreciate any comments
>
> --
> Leon
>
>
>
> A. $ cat /etc/cluster/cluster.conf
> <?xml version="1.0"?>
>  <cluster name="HA" config_version="5">
>    <logging debug="off"/>
>    <clusternodes>
>      <clusternode name="cn1.localdomain" votes="1" nodeid="1">
>        <fence>
>          <method name="pcmk-redirect">
>            <device name="pcmk" port="cn1.localdomain"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="cn2.localdomain" votes="1" nodeid="2">
>        <fence>
>          <method name="pcmk-redirect">
>            <device name="pcmk" port="cn2.localdomain"/>
>          </method>
>        </fence>
>      </clusternode>
>    </clusternodes>
>    <fencedevices>
>      <fencedevice name="pcmk" agent="fence_pcmk"/>
>    </fencedevices>
>    <rm>
>      <failoverdomains/>
>      <resources/>
>    </rm>
>  </cluster>
>
>
> B. $ pcs config
> Corosync Nodes:
>
> Pacemaker Nodes:
>  cn1.localdomain cn2.localdomain
>
> Resources:
>  Resource: resIP (provider=heartbeat type=IPaddr2 class=ocf)
>   Attributes: ip=192.168.201.220 nic=eth0 cidr_netmask=24
>   Operations: monitor interval=30s
>  Resource: resApache (provider=heartbeat type=apache class=ocf)
>   Attributes: httpd=/usr/sbin/httpd configfile=/etc/httpd/conf/httpd.conf
>   Operations: monitor interval=1min
>
> Location Constraints:
> Ordering Constraints:
>   start resApache then start resIP
> Colocation Constraints:
>   resIP with resApache
>
> Cluster Properties:
>  dc-version: 1.1.8-7.el6-394e906
>  cluster-infrastructure: cman
>  expected-quorum-votes: 2
>  stonith-enabled: false
>  no-quorum-policy: ignore
>
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org