[Pacemaker] Problem in Stonith configuration

Tue Oct 18 09:21:30 UTC 2011

Hello,
> 1. If a resource fails, node should reboot (through fencing mechanism)
> and resources should re-start on the node.

Why would you want that? This would increase the service downtime
considerable. Why is a local restart not possible ... and even if there
is a good reason for a reboot, why not starting the resource on the
other node?
-In our system, there are some primitive, clone resources along with 3
different master-slave resources.
-All the masters and slaves of these resources are co-located i.e. all the 3
masters are co-located on a node and 3 slaves on the other node.
-These 3 master-slaves resources are tightly coupled. There is a requirement
that failure of even any one of these resources, restarts all the resources
in the group
-All these resources can be shifted to the other node but subsequently these
should also be restarted as a lot of data/control plane synching is being
done between the two nodes.
e.g. If one of the resources running on node1 as a Master fails, then all
these 3 resources are shifted to the other node i.e. node2  (with
corresponding slave resources being promoted as master). On  node1, these
resources should get re-started as slaves.

We understand that node restart will increase the downtime but since we
could not find much on the option for group restart of master-slave
resources, we are trying for node restart option.

Thanks and regards
Neha Chatrath

---------- Forwarded message ----------
From: Andreas Kurz <andreas at hastexo.com>
Date: Tue, Oct 18, 2011 at 1:55 PM
Subject: Re: [Pacemaker] Problem in Stonith configuration
To: pacemaker at oss.clusterlabs.org

Hello,

On 10/18/2011 09:00 AM, neha chatrath wrote:
> Hello,
>
> Minor updates in the first requirement.
> 1. If a resource fails, node should reboot (through fencing mechanism)
> and resources should re-start on the node.

Why would you want that? This would increase the service downtime
considerable. Why is a local restart not possible ... and even if there
is a good reason for a reboot, why not starting the resource on the
other node?

> 2. If the physical link between the nodes in a cluster fails then that
> node should be isolated (kind of a power down) and the resources should
> continue to run on the other nodes

That is how stonith works, yes.

crm ra list stonith ... gives you a list of all available stonith plugins.

crm ra info stonit:xxxx ... details for a specific plugin.

Using external/ipmi is often a good choice because a lot of servers
already have an BMC with IPMI on board or they are shipped with a
management card supporting IMPI.

Regards,
Andreas

On Tue, Oct 18, 2011 at 12:30 PM, neha chatrath <nehachatrath at gmail.com>wrote:

> Hello,
>
> Minor updates in the first requirement.
> 1. If a resource fails, node should reboot (through fencing mechanism) and
> resources should re-start on the node.
>
> 2. If the physical link between the nodes in a cluster fails then that node
> should be isolated (kind of a power down) and the resources should continue
> to run on the other nodes
>
> Apologies for the inconvenience.
>
>
> Thanks and regards
> Neha Chatrath
>
> On Tue, Oct 18, 2011 at 12:08 PM, neha chatrath <nehachatrath at gmail.com>wrote:
>
>> Hello Andreas,
>>
>> Thanks for the reply.
>>
>> So can you please suggest what Stonith plugin should I use for the
>> production release of my software. I have the following system requirements:
>> 1. If a node in the cluster fails, it should be reboot and resources
>> should re-start on the node.
>> 2. If the physical link between the nodes in a cluster fails then that
>> node should be isolated (kind of a power down) and the resources should
>> continue to run on the other nodes.
>>
>> I have different types of resources e.g. primitive, master-slave and cone
>> running on my system.
>>
>> Thanks and regards
>> Neha Chatrath
>>
>>
>> Date: Mon, 17 Oct 2011 15:08:16 +0200
>> From: Andreas Kurz <andreas at hastexo.com>
>> To: pacemaker at oss.clusterlabs.org
>> Subject: Re: [Pacemaker] Problem in Stonith configuration
>> Message-ID: <4E9C28C0.8070904 at hastexo.com>
>> Content-Type: text/plain; charset="iso-8859-1"
>>
>> Hello,
>>
>>
>> On 10/17/2011 12:34 PM, neha chatrath wrote:
>> > Hello,
>> > I am configuring a 2 node cluster with following configuration:
>> >
>> > *[root at MCG1 init.d]# crm configure show
>> >
>> > node $id="16738ea4-adae-483f-9d79-
>> b0ecce8050f4" mcg2 \
>> > attributes standby="off"
>> >
>> > node $id="3d507250-780f-414a-b674-8c8d84e345cd" mcg1 \
>> > attributes standby="off"
>> >
>> > primitive ClusterIP ocf:heartbeat:IPaddr \
>> > params ip="192.168.1.204" cidr_netmask="255.255.255.0" nic="eth0:1" \
>> >
>> > op monitor interval="40s" timeout="20s" \
>> > meta target-role="Started"
>> >
>> > primitive app1_fencing stonith:suicide \
>> > op monitor interval="90" \
>> > meta target-role="Started"
>> >
>> > primitive myapp1 ocf:heartbeat:Redundancy \
>> > op monitor interval="60s" role="Master" timeout="30s" on-fail="standby"
>> \
>> > op monitor interval="40s" role="Slave" timeout="40s" on-fail="restart"
>> >
>> > primitive myapp2 ocf:mcg:Redundancy_myapp2 \
>> > op monitor interval="60" role="Master" timeout="30" on-fail="standby" \
>> > op monitor interval="40" role="Slave" timeout="40" on-fail="restart"
>> >
>> > primitive myapp3 ocf:mcg:red_app3 \
>> > op monitor interval="60" role="Master" timeout="30" on-fail="fence" \
>> > op monitor interval="40" role="Slave" timeout="40" on-fail="restart"
>> >
>> > ms ms_myapp1 myapp1 \
>> > meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
>> > notify="true"
>> >
>> > ms ms_myapp2 myapp2 \
>> > meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
>> > notify="true"
>> >
>> > ms ms_myapp3 myapp3 \
>> > meta master-max="1" master-max-node="1" clone-max="2" clone-node-max="1"
>> > notify="true"
>> >
>> > colocation myapp1_col inf: ClusterIP ms_myapp1:Master
>> >
>> > colocation myapp2_col inf: ClusterIP ms_myapp2:Master
>> >
>> > colocation myapp3_col inf: ClusterIP ms_myapp3:Master
>> >
>> > order myapp1_order inf: ms_myapp1:promote ClusterIP:start
>> >
>> > order myapp2_order inf: ms_myapp2:promote ms_myapp1:start
>> >
>> > order myapp3_order inf: ms_myapp3:promote ms_myapp2:start
>> >
>> > property $id="cib-bootstrap-options" \
>> > dc-version="1.0.11-db98485d06ed3fe0fe236509f023e1bd4a5566f1" \
>> > cluster-infrastructure="Heartbeat" \
>> > stonith-enabled="true" \
>> > no-quorum-policy="ignore"
>> >
>> > rsc_defaults $id="rsc-options" \
>> > resource-stickiness="100" \
>> > migration-threshold="3"
>> > *
>>
>> > I start Heartbeat demon only one of the nodes e.g. mcg1. But none of the
>> > resources (myapp, myapp1 etc) gets started even on this node.
>> > Following is the output of "*crm_mon -f *" command:
>> >
>> > *Last updated: Mon Oct 17 10:19:22 2011
>>
>> > Stack: Heartbeat
>> > Current DC: mcg1 (3d507250-780f-414a-b674-8c8d84e345cd)- partition with
>> > quorum
>> > Version: 1.0.11-db98485d06ed3fe0fe236509f023e1bd4a5566f1
>> > 2 Nodes configured, unknown expected votes
>> > 5 Resources configured.
>> > ============
>> > Node mcg2 (16738ea4-adae-483f-9d79-b0ecce8050f4): UNCLEAN (offline)
>>
>> The cluster is waiting for a successful fencing event before starting
>> all resources .. the only way to be sure the second node runs no
>> resources.
>>
>> Since you are using suicide pluging this will never happen if Heartbeat
>> is not started on that node. If this is only a _test_setup_ go with ssh
>> or even null stonith plugin ... never use them on production systems!
>>
>> Regards,
>> Andreas
>>
>>
>> On Mon, Oct 17, 2011 at 4:04 PM, neha chatrath <nehachatrath at gmail.com>wrote:
>>
>>> Hello,
>>> I am configuring a 2 node cluster with following configuration:
>>>
>>> *[root at MCG1 init.d]# crm configure show
>>>
>>> node $id="16738ea4-adae-483f-9d79-b0ecce8050f4" mcg2 \
>>> attributes standby="off"
>>>
>>> node $id="3d507250-780f-414a-b674-8c8d84e345cd" mcg1 \
>>> attributes standby="off"
>>>
>>> primitive ClusterIP ocf:heartbeat:IPaddr \
>>> params ip="192.168.1.204" cidr_netmask="255.255.255.0" nic="eth0:1" \
>>>
>>> op monitor interval="40s" timeout="20s" \
>>> meta target-role="Started"
>>>
>>> primitive app1_fencing stonith:suicide \
>>> op monitor interval="90" \
>>> meta target-role="Started"
>>>
>>> primitive myapp1 ocf:heartbeat:Redundancy \
>>> op monitor interval="60s" role="Master" timeout="30s" on-fail="standby" \
>>> op monitor interval="40s" role="Slave" timeout="40s" on-fail="restart"
>>>
>>> primitive myapp2 ocf:mcg:Redundancy_myapp2 \
>>> op monitor interval="60" role="Master" timeout="30" on-fail="standby" \
>>> op monitor interval="40" role="Slave" timeout="40" on-fail="restart"
>>>
>>> primitive myapp3 ocf:mcg:red_app3 \
>>> op monitor interval="60" role="Master" timeout="30" on-fail="fence" \
>>> op monitor interval="40" role="Slave" timeout="40" on-fail="restart"
>>>
>>> ms ms_myapp1 myapp1 \
>>> meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
>>> notify="true"
>>>
>>> ms ms_myapp2 myapp2 \
>>> meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
>>> notify="true"
>>>
>>> ms ms_myapp3 myapp3 \
>>> meta master-max="1" master-max-node="1" clone-max="2" clone-node-max="1"
>>> notify="true"
>>>
>>> colocation myapp1_col inf: ClusterIP ms_myapp1:Master
>>>
>>> colocation myapp2_col inf: ClusterIP ms_myapp2:Master
>>>
>>> colocation myapp3_col inf: ClusterIP ms_myapp3:Master
>>>
>>> order myapp1_order inf: ms_myapp1:promote ClusterIP:start
>>>
>>> order myapp2_order inf: ms_myapp2:promote ms_myapp1:start
>>>
>>> order myapp3_order inf: ms_myapp3:promote ms_myapp2:start
>>>
>>> property $id="cib-bootstrap-options" \
>>> dc-version="1.0.11-db98485d06ed3fe0fe236509f023e1bd4a5566f1" \
>>> cluster-infrastructure="Heartbeat" \
>>> stonith-enabled="true" \
>>> no-quorum-policy="ignore"
>>>
>>> rsc_defaults $id="rsc-options" \
>>> resource-stickiness="100" \
>>> migration-threshold="3"
>>> *
>>> I start Heartbeat demon only one of the nodes e.g. mcg1. But none of the
>>> resources (myapp, myapp1 etc) gets started even on this node.
>>> Following is the output of "*crm_mon -f *" command:
>>>
>>> *Last updated: Mon Oct 17 10:19:22 2011
>>> Stack: Heartbeat
>>> Current DC: mcg1 (3d507250-780f-414a-b674-8c8d84e345cd)- partition with
>>> quorum
>>> Version: 1.0.11-db98485d06ed3fe0fe236509f023e1bd4a5566f1
>>> 2 Nodes configured, unknown expected votes
>>> 5 Resources configured.
>>> ============
>>> Node mcg2 (16738ea4-adae-483f-9d79-b0ecce8050f4): UNCLEAN (offline)
>>> Online: [ mcg1 ]
>>> app1_fencing    (stonith:suicide):Started mcg1
>>>
>>> Migration summary:
>>> * Node mcg1:
>>> *
>>> When I set "stonith_enabled" as false, then all my resources comes up.
>>>
>>> Can somebody help me with STONITH configuration?
>>>
>>> Cheers
>>> Neha Chatrath
>>>                           KEEP SMILING!!!!
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20111018/0ae75e1c/attachment.htm>