[Pacemaker] Failed stop of stonith resource

Tue Aug 13 17:51:32 EDT 2013

Hi,

I just caught unexpected fencing of a node because (as I see from a very
quick analysis, but I may be wrong) stonith resource running on it
(fence_ipmilan) failed to start and then stop.

Excerpt from logs:

Aug 13 20:57:56 v03-a pengine[2329]:   notice: stage6: Scheduling Node
v03-a for shutdown
Aug 13 20:57:56 v03-a pengine[2329]:   notice: LogActions: Move
stonith-ipmi-v03-b#011(Started v03-a -> mgmt01)
Aug 13 20:57:56 v03-a crmd[2330]:   notice: te_rsc_command: Initiating
action 127: stop stonith-ipmi-v03-b_stop_0 on v03-a (local)
Aug 13 20:57:56 v03-a crmd[2330]:   notice: process_lrm_event: LRM
operation stonith-ipmi-v03-b_stop_0 (call=992, rc=0, cib-update=415,
confirmed=true)
Aug 13 20:57:56 v03-a crmd[2330]:   notice: te_rsc_command: Initiating
action 128: start stonith-ipmi-v03-b_start_0 on mgmt01
Aug 13 20:58:58 v03-a crmd[2330]:  warning: status_from_rc: Action 128
(stonith-ipmi-v03-b_start_0) on mgmt01 failed (target: 0 vs. rc: 1): Error
Aug 13 20:58:58 v03-a crmd[2330]:  warning: update_failcount: Updating
failcount for stonith-ipmi-v03-b on mgmt01 after failed start: rc=1
(update=INFINITY, time=1376427538)
Aug 13 20:58:58 v03-a crmd[2330]:  warning: update_failcount: Updating
failcount for stonith-ipmi-v03-b on mgmt01 after failed start: rc=1
(update=INFINITY, time=1376427538)
Aug 13 20:58:58 v03-a pengine[2329]:  warning: unpack_rsc_op: Processing
failed op start for stonith-ipmi-v03-b on mgmt01: unknown error (1)
Aug 13 20:58:58 v03-a pengine[2329]:   notice: LogActions: Recover
stonith-ipmi-v03-b#011(Started mgmt01)
Aug 13 20:58:59 v03-a crmd[2330]:   notice: te_rsc_command: Initiating
action 1: stop stonith-ipmi-v03-b_stop_0 on mgmt01
Aug 13 20:59:01 v03-a crmd[2330]:  warning: status_from_rc: Action 1
(stonith-ipmi-v03-b_stop_0) on mgmt01 failed (target: 0 vs. rc: 1): Error
Aug 13 20:59:01 v03-a crmd[2330]:  warning: update_failcount: Updating
failcount for stonith-ipmi-v03-b on mgmt01 after failed stop: rc=1
(update=INFINITY, time=1376427541)
Aug 13 20:59:01 v03-a crmd[2330]:  warning: update_failcount: Updating
failcount for stonith-ipmi-v03-b on mgmt01 after failed stop: rc=1
(update=INFINITY, time=1376427541)
Aug 13 20:59:12 v03-a pengine[2329]:  warning: unpack_rsc_op: Processing
failed op stop for stonith-ipmi-v03-b on mgmt01: unknown error (1)
Aug 13 20:59:12 v03-a pengine[2329]:  warning: pe_fence_node: Node
mgmt01 will be fenced because of resource failure(s)
Aug 13 20:59:12 v03-a pengine[2329]:  warning: common_apply_stickiness:
Forcing stonith-ipmi-v03-b away from mgmt01 after 1000000 failures
(max=1000000)
Aug 13 20:59:12 v03-a pengine[2329]:  warning: stage6: Scheduling Node
mgmt01 for STONITH

I would expect stonith resources failures not to cause fencing. Am I wrong?

mgmt01 is running merge of latest ClusterLabs and beekhof trees
(ClusterLabs/pacemaker/master 98aca50 + beekhof/pacemaker/master
86b339c), v03-a was running 2518fd0 when that happened (I was rebooting
it in order to upgrade to the above version).

Sure, reason of the failure of the fence_ipmilan requires investigations
too, but that is not important for the above issue I think.

Vladislav