[Pacemaker] fencing question

Wed Mar 12 14:17:13 UTC 2014

Hi,

we have a two node HA cluster using SuSE SlES 11 HA Extension SP3,
latest release value.
A resource (xen) was manually stopped, the shutdown_timeout is 120s
but after 60s the node was fenced and shut down by the other node.

should I change some timeout value ?

This is a part of our configuration:
...
primitive fkflmw ocf:heartbeat:Xen \
         meta target-role="Started" is-managed="true" allow-migrate="true" \
         op monitor interval="10" timeout="30" \
         op migrate_from interval="0" timeout="600" \
         op migrate_to interval="0" timeout="600" \
         params xmfile="/etc/xen/vm/fkflmw" shutdown_timeout="120"
...
...
property $id="cib-bootstrap-options" \
         dc-version="1.1.10-f3eeaf4" \
         cluster-infrastructure="classic openais (with plugin)" \
         expected-quorum-votes="2" \
         no-quorum-policy="ignore" \
         last-lrm-refresh="1394533475" \
         default-action-timeout="60s"
rsc_defaults $id="rsc_defaults-options" \
         resource-stickiness="10" \
         migration-threshold="3"

we had this scenario:

on Node ha2infra:

Mar 12 11:59:59 ha2infra pengine[25631]:   notice: LogActions: Stop     
fkflmw   (ha2infra)   <--------------- Resource fkflmw was stopped  
manually
Mar 12 11:59:59 ha2infra pengine[25631]:   notice: process_pe_message:  
Calculated Transition 105: /var/lib/pacemaker/pengine/pe-input-519.bz2
Mar 12 11:59:59 ha2infra crmd[25632]:   notice: do_te_invoke:  
Processing graph 105 (ref=pe_calc-dc-1394621999-178) derived from  
/var/lib/pacemaker/pengine/pe-input-519.bz2
Mar 12 11:59:59 ha2infra crmd[25632]:   notice: te_rsc_command:  
Initiating action 60: stop fkflmw_stop_0 on ha2infra (local)
Mar 12 11:59:59 ha2infra Xen(fkflmw)[22718]: INFO: Xen domain fkflmw  
will be stopped (timeout: 120s)   <--------------- stopping fkflmw
Mar 12 12:00:00 ha2infra mgmtd: [25633]: info: CIB query: cib
Mar 12 12:00:00 ha2infra mgmtd: [25633]: info: CIB query: cib
Mar 12 12:00:59 ha2infra sshd[24992]: Connection closed by  
134.105.232.21 [preauth]
Mar 12 12:00:59 ha2infra lrmd[25629]:  warning:  
child_timeout_callback: fkflmw_stop_0 process (PID 22718) timed out
Mar 12 12:00:59 ha2infra lrmd[25629]:  warning: operation_finished:  
fkflmw_stop_0:22718 - timed out after 60000ms   <--------------- Stop  
timed out after 60s (not 120s)
Mar 12 12:00:59 ha2infra crmd[25632]:    error: process_lrm_event: LRM  
operation fkflmw_stop_0 (136) Timed Out (timeout=60000ms)
Mar 12 12:00:59 ha2infra crmd[25632]:  warning: status_from_rc: Action  
60 (fkflmw_stop_0) on ha2infra failed (target: 0 vs. rc: 1): Error

Mar 12 12:00:59 ha2infra pengine[25631]:  warning:  
unpack_rsc_op_failure: Processing failed op stop for fkflmw on  
ha2infra: unknown error (1)
Mar 12 12:00:59 ha2infra pengine[25631]:  warning: pe_fence_node: Node  
ha2infra will be fenced because of resource failure(s)    
<--------------- is this normal ?
Mar 12 12:00:59 ha2infra pengine[25631]:  warning: stage6: Scheduling  
Node ha2infra for STONITH

Node ha1infra:

Mar 12 12:00:59 ha1infra stonith-ng[21808]:   notice:  
can_fence_host_with_device: stonith_1 can fence ha2infra: dynamic-list
Mar 12 12:01:01 ha1infra stonith-ng[21808]:   notice: log_operation:  
Operation 'reboot' [23984] (call 2 from crmd.25632) for host  
'ha2infra' with device 'stonith_1' returned: 0 (OK)
Mar 12 12:01:05 ha1infra corosync[21794]:  [TOTEM ] A processor  
failed, forming new configuration.

Karl Roessmann
-- 
Karl Rößmann				Tel. +49-711-689-1657
Max-Planck-Institut FKF       		Fax. +49-711-689-1632
Postfach 800 665
70506 Stuttgart				email K.Roessmann at fkf.mpg.de