[Pacemaker] Problems with SBD fencing

Thu Aug 1 09:46:14 EDT 2013

Hi,

I am evaluating the SLES HA Extension 11 SP3 product. The cluster  consists
of 2-nodes (active/passive), using SBD stonith resource on a shared SAN
disk. Configuration according to http://www.linux-ha.org/wiki/SBD_Fencing

The SBD daemon is running on both nodes, and the stontih resource (defined
as primitive) is running on one node only.
There is also a monitor operation for the stonith resource (interval=36000,
timeout=20)
I am having some problems getting failover/fencing to work as expected in
the following scenario:
- Node 1 is running the resources that I created (except stonith)
- Node 2 is running the stonith resource
- Disconnect Node 2 from the network by bringing the interface down
- Node 2 status changes to UNCLEAN (offline), but the stonith resource does
not switch over to Node 1 and Node 2 does not reboot as I would expect.
- Checking the logs on Node 1, I notice the following:
 Aug  1 12:00:01 slesha1n1i-u pengine[8915]:  warning: pe_fence_node: Node
slesha1n2i-u will be fenced because the node is no longer part of the
cluster
 Aug  1 12:00:01 slesha1n1i-u pengine[8915]:  warning:
determine_online_status: Node slesha1n2i-u is unclean
 Aug  1 12:00:01 slesha1n1i-u pengine[8915]:  warning: custom_action:
Action stonith_sbd_stop_0 on slesha1n2i-u is unrunnable (offline)
 Aug  1 12:00:01 slesha1n1i-u pengine[8915]:  warning: stage6: Scheduling
Node slesha1n2i-u for STONITH
 Aug  1 12:00:01 slesha1n1i-u pengine[8915]:   notice: LogActions: Move
stonith_sbd   (Started slesha1n2i-u -> slesha1n1i-u)
 ...
 Aug  1 12:00:01 slesha1n1i-u crmd[8916]:   notice: te_fence_node:
Executing reboot fencing operation (24) on slesha1n2i-u (timeout=60000)
 Aug  1 12:00:01 slesha1n1i-u stonith-ng[8912]:   notice: handle_request:
Client crmd.8916.3144546f wants to fence (reboot) 'slesha1n2i-u' with
device '(any)'
 Aug  1 12:00:01 slesha1n1i-u stonith-ng[8912]:   notice:
initiate_remote_stonith_op: Initiating remote operation reboot for
slesha1n2i-u: 8c00ff7b-2986-4b2a-8b4a-760e8346349b (0)
 Aug  1 12:00:01 slesha1n1i-u stonith-ng[8912]:    error: remote_op_done:
Operation reboot of slesha1n2i-u by slesha1n1i-u for
crmd.8916 at slesha1n1i-u.8c00ff7b: No route to host
 Aug  1 12:00:01 slesha1n1i-u crmd[8916]:   notice:
tengine_stonith_callback: Stonith operation
3/24:3:0:8a0f32b2-f91c-4cdf-9cee-1ba9b6e187ab: No route to host (-113)
 Aug  1 12:00:01 slesha1n1i-u crmd[8916]:   notice:
tengine_stonith_callback: Stonith operation 3 for slesha1n2i-u failed (No
route to host): aborting transition.
 Aug  1 12:00:01 slesha1n1i-u crmd[8916]:   notice: tengine_stonith_notify:
Peer slesha1n2i-u was not terminated (st_notify_fence) by slesha1n1i-u for
slesha1n1i-u: No route to host (ref=8c00ff7b-2986-4b2a-8b4a-760e8346349b)
by client crmd.8916
 Aug  1 12:00:01 slesha1n1i-u crmd[8916]:   notice: run_graph: Transition 3
(Complete=1, Pending=0, Fired=0, Skipped=5, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-warn-15.bz2): Stopped
 Aug  1 12:00:01 slesha1n1i-u pengine[8915]:   notice: unpack_config: On
loss of CCM Quorum: Ignore
 Aug  1 12:00:01 slesha1n1i-u pengine[8915]:  warning: pe_fence_node: Node
slesha1n2i-u will be fenced because the node is no longer part of the
cluster
 Aug  1 12:00:01 slesha1n1i-u pengine[8915]:  warning:
determine_online_status: Node slesha1n2i-u is unclean
 Aug  1 12:00:01 slesha1n1i-u pengine[8915]:  warning: custom_action:
Action stonith_sbd_stop_0 on slesha1n2i-u is unrunnable (offline)
 Aug  1 12:00:01 slesha1n1i-u pengine[8915]:  warning: stage6: Scheduling
Node slesha1n2i-u for STONITH
 Aug  1 12:00:01 slesha1n1i-u pengine[8915]:   notice: LogActions: Move
stonith_sbd   (Started slesha1n2i-u -> slesha1n1i-u)
 ...
 Aug  1 12:00:02 slesha1n1i-u crmd[8916]:   notice: too_many_st_failures:
Too many failures to fence slesha1n2i-u (11), giving up

- Then I bring Node 1 online again and start the cluster service, checking
logs:
 Aug  1 12:31:13 slesha1n1i-u corosync[8905]:  [CLM   ] CLM CONFIGURATION
CHANGE
 Aug  1 12:31:13 slesha1n1i-u corosync[8905]:  [CLM   ] New Configuration:
 Aug  1 12:31:13 slesha1n1i-u corosync[8905]:  [CLM   ]  r(0) ip(x.x.x.x)
 Aug  1 12:31:13 slesha1n1i-u corosync[8905]:  [CLM   ] Members Left:
 Aug  1 12:31:13 slesha1n1i-u corosync[8905]:  [CLM   ] Members Joined:
 Aug  1 12:31:13 slesha1n1i-u corosync[8905]:  [pcmk  ] notice:
pcmk_peer_update: Transitional membership event on ring 376: memb=1, new=0,
lost=0
 Aug  1 12:31:13 slesha1n1i-u corosync[8905]:  [pcmk  ] info:
pcmk_peer_update: memb: slesha1n1i-u 168824371
 Aug  1 12:31:13 slesha1n1i-u corosync[8905]:  [CLM   ] CLM CONFIGURATION
CHANGE
 Aug  1 12:31:13 slesha1n1i-u corosync[8905]:  [CLM   ] New Configuration:
 Aug  1 12:31:13 slesha1n1i-u corosync[8905]:  [CLM   ]  r(0) ip(x.x.x.x)
 Aug  1 12:31:13 slesha1n1i-u corosync[8905]:  [CLM   ]  r(0) ip(y.y.y.y)
 Aug  1 12:31:13 slesha1n1i-u corosync[8905]:  [CLM   ] Members Left:
 Aug  1 12:31:13 slesha1n1i-u corosync[8905]:  [CLM   ] Members Joined:
 Aug  1 12:31:13 slesha1n1i-u cib[8911]:   notice: ais_dispatch_message:
Membership 376: quorum acquired
 Aug  1 12:31:13 slesha1n1i-u corosync[8905]:  [CLM   ]  r(0) ip(y.y.y.y)
 Aug  1 12:31:13 slesha1n1i-u crmd[8916]:   notice: ais_dispatch_message:
Membership 376: quorum acquired
 Aug  1 12:31:13 slesha1n1i-u cib[8911]:   notice: crm_update_peer_state:
crm_update_ais_node: Node slesha1n2i-u[168824372] - state is now member
(was lost)
 Aug  1 12:31:13 slesha1n1i-u corosync[8905]:  [pcmk  ] notice:
pcmk_peer_update: Stable membership event on ring 376: memb=2, new=1, lost=0
 Aug  1 12:31:13 slesha1n1i-u corosync[8905]:  [pcmk  ] info:
update_member: Node 168824372/slesha1n2i-u is now: member
 Aug  1 12:31:13 slesha1n1i-u corosync[8905]:  [pcmk  ] info:
pcmk_peer_update: NEW:  slesha1n2i-u 168824372
 Aug  1 12:31:13 slesha1n1i-u crmd[8916]:   notice: crm_update_peer_state:
crm_update_ais_node: Node slesha1n2i-u[168824372] - state is now member
(was lost)
 Aug  1 12:31:13 slesha1n1i-u corosync[8905]:  [pcmk  ] info:
pcmk_peer_update: MEMB: slesha1n1i-u 168824371
 Aug  1 12:31:13 slesha1n1i-u crmd[8916]:   notice: peer_update_callback:
Node return implies stonith of slesha1n2i-u (action 24) completed
 Aug  1 12:31:13 slesha1n1i-u corosync[8905]:  [pcmk  ] info:
pcmk_peer_update: MEMB: slesha1n2i-u 168824372
 Aug  1 12:31:13 slesha1n1i-u corosync[8905]:  [pcmk  ] info:
send_member_notification: Sending membership update 376 to 2 children
 Aug  1 12:31:13 slesha1n1i-u corosync[8905]:  [TOTEM ] A processor joined
or left the membership and a new membership was formed.
 Aug  1 12:31:13 slesha1n1i-u corosync[8905]:  [pcmk  ] info:
update_member: 0x69f2f0 Node 168824372 (slesha1n2i-u) born on: 376
 Aug  1 12:31:13 slesha1n1i-u corosync[8905]:  [pcmk  ] info:
send_member_notification: Sending membership update 376 to 2 children
 Aug  1 12:31:13 slesha1n1i-u crmd[8916]:   notice: crm_update_quorum:
Updating quorum status to true (call=119)
 Aug  1 12:31:13 slesha1n1i-u corosync[8905]:  [CPG   ] chosen downlist:
sender r(0) ip(x.x.x.x) ; members(old:1 left:0)
 Aug  1 12:31:13 slesha1n1i-u corosync[8905]:  [MAIN  ] Completed service
synchronization, ready to provide service.
 Aug  1 12:31:13 slesha1n1i-u crmd[8916]:   notice: too_many_st_failures:
Too many failures to fence slesha1n2i-u (13), giving up
 Aug  1 12:31:13 slesha1n1i-u crmd[8916]:   notice: too_many_st_failures:
Too many failures to fence slesha1n2i-u (13), giving up
 Aug  1 12:31:14 slesha1n1i-u mgmtd: [8917]: info: CIB query: cib
 Aug  1 12:31:14 slesha1n1i-u corosync[8905]:  [pcmk  ] info:
update_member: Node slesha1n2i-u now has process list:
00000000000000000000000000151302 (1381122)
 Aug  1 12:31:14 slesha1n1i-u corosync[8905]:  [pcmk  ] info:
send_member_notification: Sending membership update 376 to 2 children
 Aug  1 12:31:14 slesha1n1i-u corosync[8905]:  [pcmk  ] info:
update_member: Node slesha1n2i-u now has process list:
00000000000000000000000000141302 (1315586)
 Aug  1 12:31:14 slesha1n1i-u corosync[8905]:  [pcmk  ] info:
send_member_notification: Sending membership update 376 to 2 children
 Aug  1 12:31:14 slesha1n1i-u corosync[8905]:  [pcmk  ] info:
update_member: Node slesha1n2i-u now has process list:
00000000000000000000000000101302 (1053442)
 Aug  1 12:31:14 slesha1n1i-u corosync[8905]:  [pcmk  ] info:
send_member_notification: Sending membership update 376 to 2 children
 Aug  1 12:31:15 slesha1n1i-u crmd[8916]:   notice: do_state_transition:
State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN
cause=C_HA_MESSAGE origin=route_message ]
 Aug  1 12:31:15 slesha1n1i-u crmd[8916]:   notice: too_many_st_failures:
Too many failures to fence slesha1n2i-u (13), giving up
 Aug  1 12:31:15 slesha1n1i-u crmd[8916]:   notice: too_many_st_failures:
Too many failures to fence slesha1n2i-u (13), giving up
- Cluster status changes to Online for both nodes, but the stonith resource
won't start on any of the nodes.
- Trying to start the resource manually, but no success.
- Trying to restart the corosync process on Node 1 (rcopenais restart), but
it hangs forever. Checking logs:
 Aug  1 12:42:08 slesha1n1i-u corosync[8905]:  [SERV  ] Unloading all
Corosync service engines.
 Aug  1 12:42:08 slesha1n1i-u corosync[8905]:  [pcmk  ] notice:
pcmk_shutdown: Shuting down Pacemaker
 Aug  1 12:42:08 slesha1n1i-u corosync[8905]:  [pcmk  ] notice: stop_child:
Sent -15 to mgmtd: [8917]
 Aug  1 12:42:08 slesha1n1i-u mgmtd: [8917]: info: mgmtd is shutting down
 Aug  1 12:42:08 slesha1n1i-u mgmtd: [8917]: info: final_crm: client_id=1
cib_name=live
 Aug  1 12:42:08 slesha1n1i-u mgmtd: [8917]: info: final_crm: client_id=2
cib_name=live
 Aug  1 12:42:08 slesha1n1i-u mgmtd: [8917]: debug: [mgmtd] stopped
 Aug  1 12:42:08 slesha1n1i-u corosync[8905]:  [pcmk  ] notice:
pcmk_shutdown: mgmtd confirmed stopped
 Aug  1 12:42:08 slesha1n1i-u corosync[8905]:  [pcmk  ] notice: stop_child:
Sent -15 to crmd: [8916]
 Aug  1 12:42:08 slesha1n1i-u crmd[8916]:   notice: crm_shutdown:
Requesting shutdown, upper limit is 1200000ms
 Aug  1 12:42:08 slesha1n1i-u attrd[8914]:   notice: attrd_trigger_update:
Sending flush op to all hosts for: shutdown (1375353728)
 Aug  1 12:42:08 slesha1n1i-u attrd[8914]:   notice: attrd_perform_update:
Sent update 22: shutdown=1375353728
 Aug  1 12:42:08 slesha1n1i-u crmd[8916]:   notice: too_many_st_failures:
Too many failures to fence slesha1n2i-u (13), giving up
 Aug  1 12:42:08 slesha1n1i-u crmd[8916]:  warning: do_log: FSA: Input
I_TE_SUCCESS from abort_transition_graph() received in state S_POLICY_ENGINE
 Aug  1 12:42:38 slesha1n1i-u corosync[8905]:  [pcmk  ] notice:
pcmk_shutdown: Still waiting for crmd (pid=8916, seq=6) to terminate...
 Aug  1 12:43:08 slesha1n1i-u corosync[8905]:  [pcmk  ] notice:
pcmk_shutdown: Still waiting for crmd (pid=8916, seq=6) to terminate...
...

- Finally I kill the corosync process on Node 1 (killall -9 corosync), then
corsoync restarts.
- Checking status. All resources are up and running on Node 1, and the
stonith resource is running on Node 2 again.

I have tested the same scenario several times. Sometimes the fencing
mechaism works as expected, but other times the stonith resource is not
transferred to Node 1 - as described here. So I need some assistance to
overcome this problem....

--
Best regards
Jan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130801/62f03fa1/attachment-0002.html>