[Pacemaker] Pacemaker fencing and DLM/cLVM

Mon Nov 24 15:14:26 CET 2014

Hello,

In my pacemaker/corosync cluster it looks like I have some issues with
fencing ACK on DLM/cLVM.

When a node is fenced, dlm/cLVM are not aware of the fencing results and
LVM commands hangs unless I run “dlm_tools fence_ack <ID_OF_THE_NODE>”

Here are some log around the fencing of nebula1:

Nov 24 09:51:06 nebula3 crmd[6043]:  warning: update_failcount: Updating failcount for clvm on nebula1 after failed stop: rc=1 (update=INFINITY, time=1416819066)
Nov 24 09:51:06 nebula3 pengine[6042]:  warning: unpack_rsc_op: Processing failed op stop for clvm:0 on nebula1: unknown error (1)
Nov 24 09:51:06 nebula3 pengine[6042]:  warning: pe_fence_node: Node nebula1 will be fenced because of resource failure(s)
Nov 24 09:51:06 nebula3 pengine[6042]:  warning: stage6: Scheduling Node nebula1 for STONITH
Nov 24 09:51:06 nebula3 pengine[6042]:   notice: native_stop_constraints: Stop of failed resource clvm:0 is implicit after nebula1 is fenced
Nov 24 09:51:06 nebula3 pengine[6042]:   notice: LogActions: Move    Stonith-nebula3-IPMILAN#011(Started nebula1 -> nebula2)
Nov 24 09:51:06 nebula3 pengine[6042]:   notice: LogActions: Stop    dlm:0#011(nebula1)
Nov 24 09:51:06 nebula3 pengine[6042]:   notice: LogActions: Stop    clvm:0#011(nebula1)
Nov 24 09:51:06 nebula3 pengine[6042]:  warning: process_pe_message: Calculated Transition 4: /var/lib/pacemaker/pengine/pe-warn-1.bz2
Nov 24 09:51:06 nebula3 pengine[6042]:  warning: unpack_rsc_op: Processing failed op stop for clvm:0 on nebula1: unknown error (1)
Nov 24 09:51:06 nebula3 pengine[6042]:  warning: pe_fence_node: Node nebula1 will be fenced because of resource failure(s)
Nov 24 09:51:06 nebula3 pengine[6042]:  warning: stage6: Scheduling Node nebula1 for STONITH
Nov 24 09:51:06 nebula3 pengine[6042]:   notice: native_stop_constraints: Stop of failed resource clvm:0 is implicit after nebula1 is fenced
Nov 24 09:51:06 nebula3 pengine[6042]:   notice: LogActions: Move    Stonith-nebula3-IPMILAN#011(Started nebula1 -> nebula2)
Nov 24 09:51:06 nebula3 pengine[6042]:   notice: LogActions: Stop    dlm:0#011(nebula1)
Nov 24 09:51:06 nebula3 pengine[6042]:   notice: LogActions: Stop    clvm:0#011(nebula1)
Nov 24 09:51:06 nebula3 pengine[6042]:  warning: process_pe_message: Calculated Transition 5: /var/lib/pacemaker/pengine/pe-warn-2.bz2
Nov 24 09:51:06 nebula3 crmd[6043]:   notice: te_fence_node: Executing reboot fencing operation (79) on nebula1 (timeout=30000)
Nov 24 09:51:06 nebula3 stonith-ng[6039]:   notice: handle_request: Client crmd.6043.5ec58277 wants to fence (reboot) 'nebula1' with device '(any)'
Nov 24 09:51:06 nebula3 stonith-ng[6039]:   notice: initiate_remote_stonith_op: Initiating remote operation reboot for nebula1: 50c93bed-e66f-48a5-bd2f-100a9e7ca7a1 (0)
Nov 24 09:51:06 nebula3 stonith-ng[6039]:   notice: can_fence_host_with_device: Stonith-nebula1-IPMILAN can fence nebula1: static-list
Nov 24 09:51:06 nebula3 stonith-ng[6039]:   notice: can_fence_host_with_device: Stonith-nebula2-IPMILAN can not fence nebula1: static-list
Nov 24 09:51:06 nebula3 stonith-ng[6039]:   notice: can_fence_host_with_device: Stonith-ONE-Frontend can not fence nebula1: static-list
Nov 24 09:51:09 nebula3 corosync[5987]:   [TOTEM ] A processor failed, forming new configuration.
Nov 24 09:51:13 nebula3 corosync[5987]:   [TOTEM ] A new membership (192.168.231.71:81200) was formed. Members left: 1084811078
Nov 24 09:51:13 nebula3 lvm[6311]: confchg callback. 0 joined, 1 left, 2 members
Nov 24 09:51:13 nebula3 corosync[5987]:   [QUORUM] Members[2]: 1084811079 1084811080
Nov 24 09:51:13 nebula3 corosync[5987]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 24 09:51:13 nebula3 pacemakerd[6036]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node nebula1[1084811078] - state is now lost (was member)
Nov 24 09:51:13 nebula3 crmd[6043]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node nebula1[1084811078] - state is now lost (was member)
Nov 24 09:51:13 nebula3 kernel: [  510.140107] dlm: closing connection to node 1084811078
Nov 24 09:51:13 nebula3 dlm_controld[6263]: 509 fence status 1084811078 receive 1 from 1084811079 walltime 1416819073 local 509
Nov 24 09:51:13 nebula3 dlm_controld[6263]: 509 fence request 1084811078 pid 7142 nodedown time 1416819073 fence_all dlm_stonith
Nov 24 09:51:13 nebula3 dlm_controld[6263]: 509 fence result 1084811078 pid 7142 result 1 exit status
Nov 24 09:51:13 nebula3 dlm_controld[6263]: 509 fence status 1084811078 receive 1 from 1084811080 walltime 1416819073 local 509
Nov 24 09:51:13 nebula3 dlm_controld[6263]: 509 fence request 1084811078 no actor
Nov 24 09:51:13 nebula3 stonith-ng[6039]:   notice: remote_op_done: Operation reboot of nebula1 by nebula2 for crmd.6043 at nebula3.50c93bed: OK
Nov 24 09:51:13 nebula3 crmd[6043]:   notice: tengine_stonith_callback: Stonith operation 4/79:5:0:817919e5-fa6d-4381-b0bd-42141ce0bb41: OK (0)
Nov 24 09:51:13 nebula3 crmd[6043]:   notice: tengine_stonith_notify: Peer nebula1 was terminated (reboot) by nebula2 for nebula3: OK (ref=50c93bed-e66f-48a5-bd2f-100a9e7ca7a1) by client crmd.6043
Nov 24 09:51:13 nebula3 crmd[6043]:   notice: te_rsc_command: Initiating action 22: start Stonith-nebula3-IPMILAN_start_0 on nebula2
Nov 24 09:51:14 nebula3 crmd[6043]:   notice: run_graph: Transition 5 (Complete=11, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
Nov 24 09:51:14 nebula3 pengine[6042]:   notice: process_pe_message: Calculated Transition 6: /var/lib/pacemaker/pengine/pe-input-2.bz2
Nov 24 09:51:14 nebula3 crmd[6043]:   notice: te_rsc_command: Initiating action 21: monitor Stonith-nebula3-IPMILAN_monitor_1800000 on nebula2
Nov 24 09:51:15 nebula3 crmd[6043]:   notice: run_graph: Transition 6 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2.bz2): Complete
Nov 24 09:51:15 nebula3 crmd[6043]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Nov 24 09:52:10 nebula3 dlm_controld[6263]: 566 datastores wait for fencing
Nov 24 09:52:10 nebula3 dlm_controld[6263]: 566 clvmd wait for fencing
Nov 24 09:55:10 nebula3 dlm_controld[6263]: 747 fence status 1084811078 receive -125 from 1084811079 walltime 1416819310 local 747

When the node is fenced I have “clvmd wait for fencing” and “datastores
wait for fencing” (datastores is my GFS2 volume).

Any idea of something I can check when this happens?

Regards.
-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 342 bytes
Desc: not available
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20141124/3d98284c/attachment-0001.sig>