[Pacemaker] Cluster goes to (unmanaged) Failed state when both nodes are rebooted together

Tue Oct 25 07:17:06 UTC 2011

If I start the corosync together on both the servers, it comes up good. So am just wondering how is this different from corosync being started by the server during boot up.

________________________________
From: Andreas Kurz <andreas at hastexo.com>
To: pacemaker at oss.clusterlabs.org
Sent: Monday, 24 October 2011 9:30 PM
Subject: Re: [Pacemaker] Cluster goes to (unmanaged) Failed state when both nodes are rebooted together

hello,

On 10/24/2011 05:21 PM, ihjaz Mohamed wrote:
> Its part of the requirement given to me to support this solution on
> servers without stonith devices. So I cannot enable the stonith.

Too bad, than you have to live with some limitations of this setup. You
could add some random wait to/before corosync start ... or simply: don't
reboot them at the same time ;-)

But it would also be interesting why FloatingIP_stop_0 returns an error
on both nodes ... logs should tell you what happened.

.... and remove nic="eth0:0", you must not define any alias here but
only the nic itself.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> ------------------------------------------------------------------------
> *From:* Alan Robertson <alanr at unix.sh>
> *To:* ihjaz Mohamed <ihjazmohamed at yahoo.co.in>; The Pacemaker clusterFloatingIP_stop_0
> resource manager <pacemaker at oss.clusterlabs.org>
> *Sent:* Monday, 24 October 2011 8:22 PM
> *Subject:* Re: [Pacemaker] Cluster goes to (unmanaged) Failed state when
> both nodes are rebooted together
> 
> Setting no-quorum-policy to ignore and disabling stonith is not a good
> idea.  You're sort of inviting the cluster to do screwed up things.
> 
> 
> On 10/24/2011 08:23 AM, ihjaz Mohamed wrote:
>> Hi All,
>>
>> I 've pacemaker running with corosync. Following is my CRM configuration.
>>
>> node soalaba56
>> node soalaba63
>> primitive FloatingIP ocf:heartbeat:IPaddr2 \
>>         params ip="<floating_ip>" nic="eth0:0"
>> primitive acestatus lsb:acestatus \
>> primitive pingd ocf:pacemaker:ping \
>>         params host_list="<gateway_ip>" multiplier="100" \
>>         op monitor interval="15s" timeout="5s"
>> group HAService FloatingIP acestatus \
>>         meta target-role="Started"
>> clone pingdclone pingd \
>>         meta globally-unique="false"
>> location ip1_location FloatingIP \
>>         rule $id="ip1_location-rule" pingd: defined pingd
>> property $id="cib-bootstrap-options" \
>>        
>> dc-version="1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
>>         cluster-infrastructure="openais" \
>>         expected-quorum-votes="2" \
>>         stonith-enabled="false" \
>>         no-quorum-policy="ignore" \
>>         last-lrm-refresh="1305736421"
>> ----------------------------------------------------------------------
>>
>> When I reboot both the nodes together, cluster goes into an
>> (unmanaged) Failed state as shown below.
>>
>>
>> ============
>> Last updated: Mon Oct 24 08:10:42 2011
>> Stack: openais
>> Current DC: soalaba63 - partition with quorum
>> Version: 1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
>> 2 Nodes configured, 2 expected votes
>> 2 Resources configured.
>> ============
>>
>> Online: [ soalaba56 soalaba63 ]
>>
>>  Resource Group: HAService
>>      FloatingIP (ocf::heartbeat:IPaddr2) Started  (unmanaged)
>> FAILED[   soalaba63       soalaba56 ]
>>      acestatus  (lsb:acestatus):        Stopped
>>  Clone Set: pingdclone [pingd]
>>      Started: [ soalaba56 soalaba63 ]
>>
>> Failed actions:
>>     FloatingIP_stop_0 (node=soalaba63, call=7, rc=1, status=complete):
>> unknown error
>>     FloatingIP_stop_0 (node=soalaba56, call=7, rc=1, status=complete):
>> unknown error
>> ------------------------------------------------------------------------------
>>
>> This happens only when the reboot is done simultaneously on both the
>> nodes. If reboot is done with some interval in between this is not
>> seen. Looking into the logs I see that  when the nodes come up
>> resources are started on both the nodes and then it tries to stop the
>> started resources and fails there. 
>>
>> I've attached the logs.
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org <mailto:Pacemaker at oss.clusterlabs.org>
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> 
> 
> -- 
>     Alan Robertson <alanr at unix.sh> <mailto:alanr at unix.sh>
> 
> "Openness is the foundation and preservative of friendship...  Let me claim from you at all times your undisguised opinions." - William Wilberforce
> 
> 
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20111025/a175e3e3/attachment.htm>