[Pacemaker] Cluster goes to (unmanaged) Failed state when both nodes are rebooted together

Mon Oct 24 14:52:37 UTC 2011

Setting no-quorum-policy to ignore and disabling stonith is not a good 
idea.  You're sort of inviting the cluster to do screwed up things.

On 10/24/2011 08:23 AM, ihjaz Mohamed wrote:
> Hi All,
>
> I 've pacemaker running with corosync. Following is my CRM configuration.
>
> node soalaba56
> node soalaba63
> primitive FloatingIP ocf:heartbeat:IPaddr2 \
>         params ip="<floating_ip>" nic="eth0:0"
> primitive acestatus lsb:acestatus \
> primitive pingd ocf:pacemaker:ping \
>         params host_list="<gateway_ip>" multiplier="100" \
>         op monitor interval="15s" timeout="5s"
> group HAService FloatingIP acestatus \
>         meta target-role="Started"
> clone pingdclone pingd \
>         meta globally-unique="false"
> location ip1_location FloatingIP \
>         rule $id="ip1_location-rule" pingd: defined pingd
> property $id="cib-bootstrap-options" \
>         
> dc-version="1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
>         cluster-infrastructure="openais" \
>         expected-quorum-votes="2" \
>         stonith-enabled="false" \
>         no-quorum-policy="ignore" \
>         last-lrm-refresh="1305736421"
> ----------------------------------------------------------------------
>
> When I reboot both the nodes together, cluster goes into an 
> (unmanaged) Failed state as shown below.
>
>
> ============
> Last updated: Mon Oct 24 08:10:42 2011
> Stack: openais
> Current DC: soalaba63 - partition with quorum
> Version: 1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
> 2 Nodes configured, 2 expected votes
> 2 Resources configured.
> ============
>
> Online: [ soalaba56 soalaba63 ]
>
>  Resource Group: HAService
>      FloatingIP (ocf::heartbeat:IPaddr2) Started  (unmanaged) 
> FAILED[   soalaba63       soalaba56 ]
>      acestatus  (lsb:acestatus):        Stopped
>  Clone Set: pingdclone [pingd]
>      Started: [ soalaba56 soalaba63 ]
>
> Failed actions:
>     FloatingIP_stop_0 (node=soalaba63, call=7, rc=1, status=complete): 
> unknown error
>     FloatingIP_stop_0 (node=soalaba56, call=7, rc=1, status=complete): 
> unknown error
> ------------------------------------------------------------------------------
>
> This happens only when the reboot is done simultaneously on both the 
> nodes. If reboot is done with some interval in between this is not 
> seen. Looking into the logs I see that  when the nodes come up 
> resources are started on both the nodes and then it tries to stop the 
> started resources and fails there.
>
> I've attached the logs.
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

-- 
     Alan Robertson<alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me claim from you at all times your undisguised opinions." - William Wilberforce

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20111024/9bbfab4a/attachment.htm>