[Pacemaker] R: R: Stonith external/sbd problem

Thu Apr 29 14:32:39 UTC 2010

FWIW, here's my setup for sbd on shared storage:

in /etc/init.d/boot.local:
sbd -d
/dev/disk/by-id/dm-uuid-part2-mpath-3600a0b8000266f7e000035414bd00428 -D
-W watch

xenhost1:~ # sbd -d
/dev/disk/by-id/dm-uuid-part2-mpath-3600a0b8000266f7e000035414bd00428 list
0       xenhost1        clear
1       xenhost2        clear

excerpt from 'crm configure show':
primitive sbd stonith:external/sbd \
        operations $id="sbd-operations" \
        op monitor interval="15" timeout="15" start-delay="15" \
        params
sbd_device="/dev/disk/by-id/dm-uuid-part2-mpath-3600a0b8000266f7e000035414bd00428"
clone sbd-clone sbd \
        meta interleave="true"

What do you see if you run '/usr/sbin/sbd -d /dev/mapper/mpath1p1 list'?

M.

On 04/29/2010 10:23 AM, Nicola Sabatelli wrote:
>
> Yes, I create the disk and allocate the node, and I create a resource
> on cluster in this way:
>
> <clone id="cl_external_sbd_1">
>
>         <meta_attributes id="cl_external_sbd_1-meta_attributes">
>
>           <nvpair id="cl_external_sbd_1-meta_attributes-clone-max"
> name="clone-max" value="2"/>
>
>         </meta_attributes>
>
>         <primitive class="stonith" type="external/sbd"
> id="stonith_external_sbd_LOCK_LUN">
>
>           <instance_attributes
> id="stonith_external_sbd_LOCK_LUN-instance_attributes">
>
>             <nvpair
> id="nvpair-stonith_external_sbd_LOCK_LUN-sbd_device" name="sbd_device"
> value="/dev/mapper/mpath1p1"/>
>
>           </instance_attributes>
>
>           <operations id="stonith_external_sbd_LOCK_LUN-operations">
>
>             <op id="op-stonith_external_sbd_LOCK_LUN-stop"
> interval="0" name="stop" timeout="60"/>
>
>             <op id="op-stonith_external_sbd_LOCK_LUN-monitor"
> interval="60" name="monitor" start-delay="0" timeout="60"/>
>
>             <op id="op-stonith_external_sbd_LOCK_LUN-start"
> interval="0" name="start" timeout="60"/>
>
>           </operations>
>
>           <meta_attributes
> id="stonith_external_sbd_LOCK_LUN-meta_attributes">
>
>             <nvpair name="target-role"
> id="stonith_external_sbd_LOCK_LUN-meta_attributes-target-role"
> value="stopped"/>
>
>           </meta_attributes>
>
>         </primitive>
>
>       </clone>
>
>  
>
>  
>
> Ciao, Nicola.
>
> ------------------------------------------------------------------------
>
> *Da:* Vit Pelcak [mailto:vpelcak at suse.cz]
> *Inviato:* giovedì 29 aprile 2010 16.08
> *A:* pacemaker at oss.clusterlabs.org
> *Oggetto:* Re: [Pacemaker] R: Stonith external/sbd problem
>
>  
>
> Also, it is needed to add stonith to cib:
>
> crm configure primitive sbd_stonith stonith:external/sbd meta
> target-role="Started" op monitor interval="15" timeout="15"
> start-delay="15" params sbd_device="/dev/sda1"
>
>
> Dne 29.4.2010 15:46, Nicola Sabatelli napsal(a):
>
> I have done exactly the configuration in the SBD_Fencing documentation.
>
> That is:
>
> /etc/sysconfig/sbd
>
> SBD_DEVICE="/dev/mapper/mpath1p1"
>
> SBD_OPTS="-W"
>
> And I start the demon in this manner:
>
> /usr/sbin/sbd -d /dev/mapper/mpath1p1 -D -W watch
>
> Is correct?
>
>  
>
> Ciao, Nicola.
>
> ------------------------------------------------------------------------
>
> *Da:* Vit Pelcak [mailto:vpelcak at suse.cz]
> *Inviato:* giovedì 29 aprile 2010 15.02
> *A:* pacemaker at oss.clusterlabs.org <mailto:pacemaker at oss.clusterlabs.org>
> *Oggetto:* Re: [Pacemaker] Stonith external/sbd problem
>
>  
>
> cat /etc/sysconfig/sbd
>
> SBD_DEVICE="/dev/sda1"
> SBD_OPTS="-W"
>
>
> sbd -d /dev/shared_disk create
> sbd -d /dev/shared_disk allocate your_machine
>
>
> Dne 29.4.2010 14:55, Michael Brown napsal(a):
>
> Oh, I forgot a piece: I had simular trouble until I actually properly
> started sbd and then it worked.
>
> M.
>
> ------------------------------------------------------------------------
>
> *From*: Michael Brown
> *To*: pacemaker at oss.clusterlabs.org
> <mailto:pacemaker at oss.clusterlabs.org>
> *Sent*: Thu Apr 29 08:53:32 2010
> *Subject*: Re: [Pacemaker] Stonith external/sbd problem
>
>
> I just set this up myself and it worked fine for me.
>
> Did you follow the guide? You need to configure the sbd daemon to run
> on bootup with appropriate options before external/sbd can use it.
>
> M.
>
> ------------------------------------------------------------------------
>
> *From*: Nicola Sabatelli
> *To*: pacemaker at oss.clusterlabs.org
> <mailto:pacemaker at oss.clusterlabs.org>
> *Sent*: Thu Apr 29 08:47:04 2010
> *Subject*: [Pacemaker] Stonith external/sbd problem
>
> I have a problem with STONITH plugin external/sbd.
>
> I have configured the system in according to directive that I find at
> url http://www.linux-ha.org/wiki/SBD_Fencing, and the device that I
> use is configured with multipath software because this disk is
> residend on a storage system.
>
> I have create a resurse on my cluster using clove directive.
>
> But when I try to start the resurse I have these errors:
>
>  
>
> from ha-log file:
>
>  
>
> Apr 29 14:37:51 clover-h stonithd: [16811]: info: external_run_cmd:
> Calling '/usr/lib64/stonith/plugins/external/sbd status' returned 256
>
> Apr 29 14:37:51 clover-h stonithd: [16811]: CRIT: external_status:
> 'sbd status' failed with rc 256
>
> Apr 29 14:37:51 clover-h stonithd: [10615]: WARN: start
> stonith_external_sbd_LOCK_LUN:0 failed, because its hostlist is empty
>
>  
>
> from crm_verify:
>
>  
>
> crm_verify[18607]: 2010/04/29_14:39:27 info: main: =#=#=#=#= Getting
> XML =#=#=#=#=
>
> crm_verify[18607]: 2010/04/29_14:39:27 info: main: Reading XML from:
> live cluster
>
> crm_verify[18607]: 2010/04/29_14:39:27 notice: unpack_config: On loss
> of CCM Quorum: Ignore
>
> crm_verify[18607]: 2010/04/29_14:39:27 info: unpack_config: Node
> scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
>
> crm_verify[18607]: 2010/04/29_14:39:27 info: determine_online_status:
> Node clover-a.rsr.rupar.puglia.it is online
>
> crm_verify[18607]: 2010/04/29_14:39:27 WARN: unpack_rsc_op: Processing
> failed op stonith_external_sbd_LOCK_LUN:1_start_0 on
> clover-a.rsr.rupar.puglia.it: unknown error (1)
>
> crm_verify[18607]: 2010/04/29_14:39:27 info: find_clone: Internally
> renamed stonith_external_sbd_LOCK_LUN:0 on
> clover-a.rsr.rupar.puglia.it to stonith_external_sbd_LOCK_LUN:2 (ORPHAN)
>
> crm_verify[18607]: 2010/04/29_14:39:27 info: determine_online_status:
> Node clover-h.rsr.rupar.puglia.it is online
>
> crm_verify[18607]: 2010/04/29_14:39:27 WARN: unpack_rsc_op: Processing
> failed op stonith_external_sbd_LOCK_LUN:0_start_0 on
> clover-h.rsr.rupar.puglia.it: unknown error (1)
>
> crm_verify[18607]: 2010/04/29_14:39:27 notice: clone_print: 
> Master/Slave Set: ms_drbd_1
>
> crm_verify[18607]: 2010/04/29_14:39:27 notice: short_print:     
> Stopped: [ res_drbd_1:0 res_drbd_1:1 ]
>
> crm_verify[18607]: 2010/04/29_14:39:27 notice: native_print:
> res_Filesystem_TEST        (ocf::heartbeat:Filesystem):    Stopped
>
> crm_verify[18607]: 2010/04/29_14:39:27 notice: native_print:
> res_IPaddr2_ip_clover      (ocf::heartbeat:IPaddr2):       Stopped
>
> crm_verify[18607]: 2010/04/29_14:39:27 notice: clone_print:  Clone
> Set: cl_external_sbd_1
>
> crm_verify[18607]: 2010/04/29_14:39:27 notice: native_print:     
> stonith_external_sbd_LOCK_LUN:0       (stonith:external/sbd): Started
> clover-h.rsr.rupar.puglia.it FAILED
>
> crm_verify[18607]: 2010/04/29_14:39:27 notice: native_print:     
> stonith_external_sbd_LOCK_LUN:1       (stonith:external/sbd): Started
> clover-a.rsr.rupar.puglia.it FAILED
>
> crm_verify[18607]: 2010/04/29_14:39:27 info: get_failcount:
> cl_external_sbd_1 has failed 1000000 times on clover-h.rsr.rupar.puglia.it
>
> crm_verify[18607]: 2010/04/29_14:39:27 WARN: common_apply_stickiness:
> Forcing cl_external_sbd_1 away from clover-h.rsr.rupar.puglia.it after
> 1000000 failures (max=1000000)
>
> crm_verify[18607]: 2010/04/29_14:39:27 info: get_failcount:
> cl_external_sbd_1 has failed 1000000 times on clover-a.rsr.rupar.puglia.it
>
> crm_verify[18607]: 2010/04/29_14:39:27 WARN: common_apply_stickiness:
> Forcing cl_external_sbd_1 away from clover-a.rsr.rupar.puglia.it after
> 1000000 failures (max=1000000)
>
> crm_verify[18607]: 2010/04/29_14:39:27 info: native_merge_weights:
> ms_drbd_1: Rolling back scores from res_Filesystem_TEST
>
> crm_verify[18607]: 2010/04/29_14:39:27 WARN: native_color: Resource
> res_drbd_1:0 cannot run anywhere
>
> crm_verify[18607]: 2010/04/29_14:39:27 WARN: native_color: Resource
> res_drbd_1:1 cannot run anywhere
>
> crm_verify[18607]: 2010/04/29_14:39:27 info: native_merge_weights:
> ms_drbd_1: Rolling back scores from res_Filesystem_TEST
>
> crm_verify[18607]: 2010/04/29_14:39:27 info: master_color: ms_drbd_1:
> Promoted 0 instances of a possible 1 to master
>
> crm_verify[18607]: 2010/04/29_14:39:27 info: master_color: ms_drbd_1:
> Promoted 0 instances of a possible 1 to master
>
> crm_verify[18607]: 2010/04/29_14:39:27 info: native_merge_weights:
> res_Filesystem_TEST: Rolling back scores from res_IPaddr2_ip_clover
>
> crm_verify[18607]: 2010/04/29_14:39:27 WARN: native_color: Resource
> res_Filesystem_TEST cannot run anywhere
>
> crm_verify[18607]: 2010/04/29_14:39:27 WARN: native_color: Resource
> res_IPaddr2_ip_clover cannot run anywhere
>
> crm_verify[18607]: 2010/04/29_14:39:27 WARN: native_color: Resource
> stonith_external_sbd_LOCK_LUN:0 cannot run anywhere
>
> crm_verify[18607]: 2010/04/29_14:39:27 WARN: native_color: Resource
> stonith_external_sbd_LOCK_LUN:1 cannot run anywhere
>
> crm_verify[18607]: 2010/04/29_14:39:27 notice: LogActions: Leave
> resource res_drbd_1:0  (Stopped)
>
> crm_verify[18607]: 2010/04/29_14:39:27 notice: LogActions: Leave
> resource res_drbd_1:1  (Stopped)
>
> crm_verify[18607]: 2010/04/29_14:39:27 notice: LogActions: Leave
> resource res_Filesystem_TEST   (Stopped)
>
> crm_verify[18607]: 2010/04/29_14:39:27 notice: LogActions: Leave
> resource res_IPaddr2_ip_clover (Stopped)
>
> crm_verify[18607]: 2010/04/29_14:39:27 notice: LogActions: Stop
> resource stonith_external_sbd_LOCK_LUN:0       
> (clover-h.rsr.rupar.puglia.it)
>
> crm_verify[18607]: 2010/04/29_14:39:27 notice: LogActions: Stop
> resource stonith_external_sbd_LOCK_LUN:1       
> (clover-a.rsr.rupar.puglia.it)
>
> Warnings found during check: config may not be valid
>
>  
>
> and from crm_mon:
>
>  
>
> ============
>
> Last updated: Thu Apr 29 14:39:57 2010
>
> Stack: Heartbeat
>
> Current DC: clover-h.rsr.rupar.puglia.it
> (e39bb201-2a6f-457a-a308-be6bfe71309c) - partition with quorum
>
> Version: 1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7
>
> 2 Nodes configured, unknown expected votes
>
> 4 Resources configured.
>
> ============
>
>  
>
> Online: [ clover-h.rsr.rupar.puglia.it clover-a.rsr.rupar.puglia.it ]
>
>  
>
>  Clone Set: cl_external_sbd_1
>
>      stonith_external_sbd_LOCK_LUN:0    (stonith:external/sbd):
> Started clover-h.rsr.rupar.puglia.it FAILED
>
>      stonith_external_sbd_LOCK_LUN:1    (stonith:external/sbd):
> Started clover-a.rsr.rupar.puglia.it FAILED
>
>  
>
> Operations:
>
> * Node clover-a.rsr.rupar.puglia.it:
>
>    stonith_external_sbd_LOCK_LUN:1: migration-threshold=1000000
> fail-count=1000000
>
>     + (24) start: rc=1 (unknown error)
>
> * Node clover-h.rsr.rupar.puglia.it:
>
>    stonith_external_sbd_LOCK_LUN:0: migration-threshold=1000000
> fail-count=1000000
>
>     + (25) start: rc=1 (unknown error)
>
>  
>
> Failed actions:
>
>     stonith_external_sbd_LOCK_LUN:1_start_0
> (node=clover-a.rsr.rupar.puglia.it, call=24, rc=1, status=complete):
> unknown error
>
>     stonith_external_sbd_LOCK_LUN:0_start_0
> (node=clover-h.rsr.rupar.puglia.it, call=25, rc=1, status=complete):
> unknown error
>
>  
>
>  
>
>  
>
>  
>
> Ciao, Nicola.
>
>  
>
>  
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org <mailto:Pacemaker at oss.clusterlabs.org>
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>  
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>
>  
>
>  
>
>  
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org <mailto:Pacemaker at oss.clusterlabs.org>
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>  
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>
>  
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

-- 
Michael Brown               | `One of the main causes of the fall of
Systems Consultant          | the Roman Empire was that, lacking zero,
Net Direct Inc.             | they had no way to indicate successful
?: +1 519 883 1172 x5106    | termination of their C programs.' - Firth

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100429/ed371b03/attachment-0002.htm>