[Pacemaker] resources does not start on survied node after reboot

Thu Oct 31 03:20:50 UTC 2013

On 30 Oct 2013, at 1:12 am, Саша Александров <shurrman at gmail.com> wrote:

> Hi!
> 
> I have a 2-node cluster with shared storage and SBD-fencing.
> One node was down for maintenance.
> Due to external reasons, second node was rebotted. After reboot service never got up:
> 
> Oct 29 13:04:21 wcs2 pengine[2362]:  warning: stage6: Scheduling Node wcs1 for STONITH
> Oct 29 13:04:21 wcs2 crmd[2363]:   notice: te_fence_node: Executing reboot fencing operation (53) on wcs1 (timeout=60000)
> Oct 29 13:05:33 wcs2 stonith-ng[2359]:    error: remote_op_done: Operation reboot of wcs1 by wcs2 for crmd.2363 at wcs2.4a3b045d: Timer expired
> Oct 29 13:05:33 wcs2 crmd[2363]:   notice: tengine_stonith_callback: Stonith operation 2/53:0:0:f56c4538-1ad8-4871-825e-167eb9304677: Timer expired (-62)
> Oct 29 13:05:33 wcs2 crmd[2363]:   notice: tengine_stonith_callback: Stonith operation 2 for wcs1 failed (Timer expired): aborting transition.
> Oct 29 13:05:33 wcs2 crmd[2363]:   notice: tengine_stonith_notify: Peer wcs1 was not terminated (st_notify_fence) by wcs2 for wcs2: Timer expired (ref=4a3b045d-cc08-4e2f-8279-a85d113781b2) by client crmd.2363
> Oct 29 13:05:33 wcs2 crmd[2363]:   notice: run_graph: Transition 0 (Complete=20, Pending=0, Fired=0, Skipped=29, Incomplete=0, Source=/usr/var/lib/pacemaker/pengine/pe-warn-54.bz2): Stopped
> Oct 29 13:05:33 wcs2 pengine[2362]:   notice: unpack_config: On loss of CCM Quorum: Ignore
> Oct 29 13:05:33 wcs2 pengine[2362]:  warning: stage6: Scheduling Node wcs1 for STONITH
> 
> And this runs forever in cycle...
> 
> The node wcs1 is off, should not SBD determine that, and should not the cluster start the resources?

The cluster can't start resources until fencing completes.
For some reason SBD is reporting that it is unable to fence wcs1 and so the cluster cannot continue.