[Pacemaker] Problem in Stonith configuration

Tue Oct 18 15:26:41 UTC 2011

On 10/18/2011 11:21 AM, neha chatrath wrote:
> Hello,
>> 1. If a resource fails, node should reboot (through fencing mechanism)
>> and resources should re-start on the node.
> 
> Why would you want that? This would increase the service downtime
> considerable. Why is a local restart not possible ... and even if there
> is a good reason for a reboot, why not starting the resource on the
> other node?
> -In our system, there are some primitive, clone resources along with 3
> different master-slave resources.
> -All the masters and slaves of these resources are co-located i.e. all
> the 3 masters are co-located on a node and 3 slaves on the other node.
> -These 3 master-slaves resources are tightly coupled. There is a
> requirement that failure of even any one of these resources, restarts
> all the resources in the group
> -All these resources can be shifted to the other node but subsequently
> these should also be restarted as a lot of data/control plane synching
> is being done between the two nodes.
> e.g. If one of the resources running on node1 as a Master fails, then
> all these 3 resources are shifted to the other node i.e. node2  (with
> corresponding slave resources being promoted as master). On  node1,
> these resources should get re-started as slaves.
> 
> We understand that node restart will increase the downtime but since we
> could not find much on the option for group restart of master-slave
> resources, we are trying for node restart option.

Hmm ... then the definition of on-fail=fence on monitoring failure
should give you what you want.

Consequently you should define this for all MS resources because once a
node is in standby (on-fail=standby) it has to be manually put online
again ... or keep the default for the "less important" MS resources.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> Thanks and regards
> Neha Chatrath
> 
> ---------- Forwarded message ----------
> From: *Andreas Kurz* <andreas at hastexo.com <mailto:andreas at hastexo.com>>
> Date: Tue, Oct 18, 2011 at 1:55 PM
> Subject: Re: [Pacemaker] Problem in Stonith configuration
> To: pacemaker at oss.clusterlabs.org <mailto:pacemaker at oss.clusterlabs.org>
> 
> 
> Hello,
> 
> On 10/18/2011 09:00 AM, neha chatrath wrote:
>> Hello,
>>
>> Minor updates in the first requirement.
>> 1. If a resource fails, node should reboot (through fencing mechanism)
>> and resources should re-start on the node.
> 
> Why would you want that? This would increase the service downtime
> considerable. Why is a local restart not possible ... and even if there
> is a good reason for a reboot, why not starting the resource on the
> other node?
> 
>> 2. If the physical link between the nodes in a cluster fails then that
>> node should be isolated (kind of a power down) and the resources should
>> continue to run on the other nodes
> 
> That is how stonith works, yes.
> 
> crm ra list stonith ... gives you a list of all available stonith plugins.
> 
> crm ra info stonit:xxxx ... details for a specific plugin.
> 
> Using external/ipmi is often a good choice because a lot of servers
> already have an BMC with IPMI on board or they are shipped with a
> management card supporting IMPI.
> 
> Regards,
> Andreas
> 
> 
> On Tue, Oct 18, 2011 at 12:30 PM, neha chatrath <nehachatrath at gmail.com
> <mailto:nehachatrath at gmail.com>> wrote:
> 
>     Hello,
> 
>     Minor updates in the first requirement.
>     1. If a resource fails, node should reboot (through fencing
>     mechanism) and resources should re-start on the node.
> 
>     2. If the physical link between the nodes in a cluster fails then
>     that node should be isolated (kind of a power down) and the
>     resources should continue to run on the other nodes
> 
>     Apologies for the inconvenience.
> 
> 
>     Thanks and regards
>     Neha Chatrath
> 
>     On Tue, Oct 18, 2011 at 12:08 PM, neha chatrath
>     <nehachatrath at gmail.com <mailto:nehachatrath at gmail.com>> wrote:
> 
>         Hello Andreas,
> 
>         Thanks for the reply.
> 
>         So can you please suggest what Stonith plugin should I use for
>         the production release of my software. I have the following
>         system requirements:
>         1. If a node in the cluster fails, it should be reboot and
>         resources should re-start on the node.
>         2. If the physical link between the nodes in a cluster fails
>         then that node should be isolated (kind of a power down) and the
>         resources should continue to run on the other nodes.
> 
>         I have different types of resources e.g. primitive, master-slave
>         and cone running on my system.
> 
>         Thanks and regards
>         Neha Chatrath
> 
> 
>         Date: Mon, 17 Oct 2011 15:08:16 +0200
>         From: Andreas Kurz <andreas at hastexo.com
>         <mailto:andreas at hastexo.com>>
>         To: pacemaker at oss.clusterlabs.org
>         <mailto:pacemaker at oss.clusterlabs.org>
>         Subject: Re: [Pacemaker] Problem in Stonith configuration
>         Message-ID: <4E9C28C0.8070904 at hastexo.com
>         <mailto:4E9C28C0.8070904 at hastexo.com>>
>         Content-Type: text/plain; charset="iso-8859-1"
> 
>         Hello,
> 
> 
>         On 10/17/2011 12:34 PM, neha chatrath wrote:
>         > Hello,
>         > I am configuring a 2 node cluster with following configuration:
>         >
>         > *[root at MCG1 init.d]# crm configure show
>         >
>         > node $id="16738ea4-adae-483f-9d79-
>         b0ecce8050f4" mcg2 \
>         > attributes standby="off"
>         >
>         > node $id="3d507250-780f-414a-b674-8c8d84e345cd" mcg1 \
>         > attributes standby="off"
>         >
>         > primitive ClusterIP ocf:heartbeat:IPaddr \
>         > params ip="192.168.1.204" cidr_netmask="255.255.255.0"
>         nic="eth0:1" \
>         >
>         > op monitor interval="40s" timeout="20s" \
>         > meta target-role="Started"
>         >
>         > primitive app1_fencing stonith:suicide \
>         > op monitor interval="90" \
>         > meta target-role="Started"
>         >
>         > primitive myapp1 ocf:heartbeat:Redundancy \
>         > op monitor interval="60s" role="Master" timeout="30s"
>         on-fail="standby" \
>         > op monitor interval="40s" role="Slave" timeout="40s"
>         on-fail="restart"
>         >
>         > primitive myapp2 ocf:mcg:Redundancy_myapp2 \
>         > op monitor interval="60" role="Master" timeout="30"
>         on-fail="standby" \
>         > op monitor interval="40" role="Slave" timeout="40"
>         on-fail="restart"
>         >
>         > primitive myapp3 ocf:mcg:red_app3 \
>         > op monitor interval="60" role="Master" timeout="30"
>         on-fail="fence" \
>         > op monitor interval="40" role="Slave" timeout="40"
>         on-fail="restart"
>         >
>         > ms ms_myapp1 myapp1 \
>         > meta master-max="1" master-node-max="1" clone-max="2"
>         clone-node-max="1"
>         > notify="true"
>         >
>         > ms ms_myapp2 myapp2 \
>         > meta master-max="1" master-node-max="1" clone-max="2"
>         clone-node-max="1"
>         > notify="true"
>         >
>         > ms ms_myapp3 myapp3 \
>         > meta master-max="1" master-max-node="1" clone-max="2"
>         clone-node-max="1"
>         > notify="true"
>         >
>         > colocation myapp1_col inf: ClusterIP ms_myapp1:Master
>         >
>         > colocation myapp2_col inf: ClusterIP ms_myapp2:Master
>         >
>         > colocation myapp3_col inf: ClusterIP ms_myapp3:Master
>         >
>         > order myapp1_order inf: ms_myapp1:promote ClusterIP:start
>         >
>         > order myapp2_order inf: ms_myapp2:promote ms_myapp1:start
>         >
>         > order myapp3_order inf: ms_myapp3:promote ms_myapp2:start
>         >
>         > property $id="cib-bootstrap-options" \
>         > dc-version="1.0.11-db98485d06ed3fe0fe236509f023e1bd4a5566f1" \
>         > cluster-infrastructure="Heartbeat" \
>         > stonith-enabled="true" \
>         > no-quorum-policy="ignore"
>         >
>         > rsc_defaults $id="rsc-options" \
>         > resource-stickiness="100" \
>         > migration-threshold="3"
>         > *
> 
>         > I start Heartbeat demon only one of the nodes e.g. mcg1. But
>         none of the
>         > resources (myapp, myapp1 etc) gets started even on this node.
>         > Following is the output of "*crm_mon -f *" command:
>         >
>         > *Last updated: Mon Oct 17 10:19:22 2011
> 
>         > Stack: Heartbeat
>         > Current DC: mcg1 (3d507250-780f-414a-b674-8c8d84e345cd)-
>         partition with
>         > quorum
>         > Version: 1.0.11-db98485d06ed3fe0fe236509f023e1bd4a5566f1
>         > 2 Nodes configured, unknown expected votes
>         > 5 Resources configured.
>         > ============
>         > Node mcg2 (16738ea4-adae-483f-9d79-b0ecce8050f4): UNCLEAN
>         (offline)
> 
>         The cluster is waiting for a successful fencing event before
>         starting
>         all resources .. the only way to be sure the second node runs no
>         resources.
> 
>         Since you are using suicide pluging this will never happen if
>         Heartbeat
>         is not started on that node. If this is only a _test_setup_ go
>         with ssh
>         or even null stonith plugin ... never use them on production
>         systems!
> 
>         Regards,
>         Andreas
> 
> 
>         On Mon, Oct 17, 2011 at 4:04 PM, neha chatrath
>         <nehachatrath at gmail.com <mailto:nehachatrath at gmail.com>> wrote:
> 
>             Hello,
>             I am configuring a 2 node cluster with following configuration:
> 
>             *[root at MCG1 init.d]# crm configure show
> 
>             node $id="16738ea4-adae-483f-9d79-b0ecce8050f4" mcg2 \
>             attributes standby="off"
> 
>             node $id="3d507250-780f-414a-b674-8c8d84e345cd" mcg1 \
>             attributes standby="off"
> 
>             primitive ClusterIP ocf:heartbeat:IPaddr \
>             params ip="192.168.1.204" cidr_netmask="255.255.255.0"
>             nic="eth0:1" \
> 
>             op monitor interval="40s" timeout="20s" \
>             meta target-role="Started"
> 
>             primitive app1_fencing stonith:suicide \
>             op monitor interval="90" \
>             meta target-role="Started"
> 
>             primitive myapp1 ocf:heartbeat:Redundancy \
>             op monitor interval="60s" role="Master" timeout="30s"
>             on-fail="standby" \
>             op monitor interval="40s" role="Slave" timeout="40s"
>             on-fail="restart"
> 
>             primitive myapp2 ocf:mcg:Redundancy_myapp2 \
>             op monitor interval="60" role="Master" timeout="30"
>             on-fail="standby" \
>             op monitor interval="40" role="Slave" timeout="40"
>             on-fail="restart"
> 
>             primitive myapp3 ocf:mcg:red_app3 \
>             op monitor interval="60" role="Master" timeout="30"
>             on-fail="fence" \
>             op monitor interval="40" role="Slave" timeout="40"
>             on-fail="restart"
> 
>             ms ms_myapp1 myapp1 \
>             meta master-max="1" master-node-max="1" clone-max="2"
>             clone-node-max="1" notify="true"
> 
>             ms ms_myapp2 myapp2 \
>             meta master-max="1" master-node-max="1" clone-max="2"
>             clone-node-max="1" notify="true"
> 
>             ms ms_myapp3 myapp3 \
>             meta master-max="1" master-max-node="1" clone-max="2"
>             clone-node-max="1" notify="true"
> 
>             colocation myapp1_col inf: ClusterIP ms_myapp1:Master
> 
>             colocation myapp2_col inf: ClusterIP ms_myapp2:Master
> 
>             colocation myapp3_col inf: ClusterIP ms_myapp3:Master
> 
>             order myapp1_order inf: ms_myapp1:promote ClusterIP:start
> 
>             order myapp2_order inf: ms_myapp2:promote ms_myapp1:start
> 
>             order myapp3_order inf: ms_myapp3:promote ms_myapp2:start
> 
>             property $id="cib-bootstrap-options" \
>             dc-version="1.0.11-db98485d06ed3fe0fe236509f023e1bd4a5566f1" \
>             cluster-infrastructure="Heartbeat" \
>             stonith-enabled="true" \
>             no-quorum-policy="ignore"
> 
>             rsc_defaults $id="rsc-options" \
>             resource-stickiness="100" \
>             migration-threshold="3"
>             *
>             I start Heartbeat demon only one of the nodes e.g. mcg1. But
>             none of the resources (myapp, myapp1 etc) gets started even
>             on this node.
>             Following is the output of "*crm_mon -f *" command:
> 
>             *Last updated: Mon Oct 17 10:19:22 2011
>             Stack: Heartbeat
>             Current DC: mcg1 (3d507250-780f-414a-b674-8c8d84e345cd)-
>             partition with quorum
>             Version: 1.0.11-db98485d06ed3fe0fe236509f023e1bd4a5566f1
>             2 Nodes configured, unknown expected votes
>             5 Resources configured.
>             ============
>             Node mcg2 (16738ea4-adae-483f-9d79-b0ecce8050f4): UNCLEAN
>             (offline)
>             Online: [ mcg1 ]
>             app1_fencing    (stonith:suicide):Started mcg1
> 
>             Migration summary:
>             * Node mcg1:
>             *
>             When I set "stonith_enabled" as false, then all my resources
>             comes up.
> 
>             Can somebody help me with STONITH configuration? 
> 
>             Cheers
>             Neha Chatrath
>                                       KEEP SMILING!!!!
> 
> 
> 
> 
> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 286 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20111018/31209f20/attachment-0004.sig>