[Pacemaker] help deciphering output

Thu Oct 9 23:04:59 EDT 2014

On 9 Oct 2014, at 5:06 pm, Alexandre <alxgomz at gmail.com> wrote:

> I have seen this behavior on several virtualsed environments. when vm backup starts, the VM actually freezes for a (short?) Period of time.I guess it then no more responding to the other cluster nodes thus triggering unexpected fail over and/or fencing.

Alas the dlm is _really_ intolerant of any membership blips.
Once a node is marked failed the dlm wants it fenced.  Even if is comes back 1ms later.

> I have this kind of behavior on VMware env using veam backup, as well promox (+ u don't what backup tool)
> That's actually an interesting topic I never though about rising here.
> How can we avoid that? Increasing timeout? I am afraid we would have to reach unacceptable high timeout values and am not even sure that would fix the pb.
> I think not all VM snapshots strategy would trigger that PV, do you guys have any feedback to provide on the backup/snapshot method best suits corosync clusters?
> 
> Regards
> 
> Le 9 oct. 2014 01:24, "Alex Samad - Yieldbroker" <Alex.Samad at yieldbroker.com> a écrit :
> One of my nodes died in a 2 node cluster
> 
> I gather something went wrong, and it fenced/killed itself. But I am not sure what happened.
> 
> I think maybe around that time the VM backups happened and snap of the VM could have happened
> 
> But there is nothing for me to put my finger on
> 
> Output from messages around that time
> 
> This is on devrp1
> Oct  8 23:31:38 devrp1 corosync[1670]:   [TOTEM ] A processor failed, forming new configuration.
> Oct  8 23:31:40 devrp1 corosync[1670]:   [CMAN  ] quorum lost, blocking activity
> Oct  8 23:31:40 devrp1 corosync[1670]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
> Oct  8 23:31:40 devrp1 corosync[1670]:   [QUORUM] Members[1]: 1
> Oct  8 23:31:40 devrp1 corosync[1670]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Oct  8 23:31:40 devrp1 corosync[1670]:   [CPG   ] chosen downlist: sender r(0) ip(10.172.214.51) ; members(old:2 left:1)
> Oct  8 23:31:40 devrp1 corosync[1670]:   [MAIN  ] Completed service synchronization, ready to provide service.
> Oct  8 23:31:41 devrp1 kernel: dlm: closing connection to node 2
> Oct  8 23:31:42 devrp1 crmd[2350]:   notice: cman_event_callback: Membership 424: quorum lost
> Oct  8 23:31:42 devrp1 corosync[1670]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Oct  8 23:31:42 devrp1 corosync[1670]:   [CMAN  ] quorum regained, resuming activity
> Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] This node is within the primary component and will provide service.
> Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] Members[2]: 1 2
> Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] Members[2]: 1 2
> Oct  8 23:31:42 devrp1 corosync[1670]:   [CPG   ] chosen downlist: sender r(0) ip(10.172.214.51) ; members(old:1 left:0)
> Oct  8 23:31:42 devrp1 corosync[1670]:   [MAIN  ] Completed service synchronization, ready to provide service.
> Oct  8 23:31:42 devrp1 crmd[2350]:   notice: crm_update_peer_state: cman_event_callback: Node devrp2[2] - state is now lost (was member)
> Oct  8 23:31:42 devrp1 crmd[2350]:  warning: reap_dead_nodes: Our DC node (devrp2) left the cluster
> Oct  8 23:31:42 devrp1 crmd[2350]:   notice: cman_event_callback: Membership 428: quorum acquired
> Oct  8 23:31:42 devrp1 crmd[2350]:   notice: crm_update_peer_state: cman_event_callback: Node devrp2[2] - state is now member (was lost)
> Oct  8 23:31:42 devrp1 crmd[2350]:   notice: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=reap_dead_nodes ]
> Oct  8 23:31:42 devrp1 corosync[1670]: cman killed by node 2 because we were killed by cman_tool or other application
> Oct  8 23:31:42 devrp1 pacemakerd[2339]:    error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
> Oct  8 23:31:42 devrp1 stonith-ng[2346]:    error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
> Oct  8 23:31:42 devrp1 crmd[2350]:    error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
> Oct  8 23:31:42 devrp1 crmd[2350]:    error: crmd_cs_destroy: connection terminated
> Oct  8 23:31:43 devrp1 fenced[1726]: cluster is down, exiting
> Oct  8 23:31:43 devrp1 fenced[1726]: daemon cpg_dispatch error 2
> Oct  8 23:31:43 devrp1 attrd[2348]:    error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
> Oct  8 23:31:43 devrp1 attrd[2348]:     crit: attrd_cs_destroy: Lost connection to Corosync service!
> Oct  8 23:31:43 devrp1 attrd[2348]:   notice: main: Exiting...
> Oct  8 23:31:43 devrp1 attrd[2348]:   notice: main: Disconnecting client 0x18cf240, pid=2350...
> Oct  8 23:31:43 devrp1 pacemakerd[2339]:    error: mcp_cpg_destroy: Connection destroyed
> Oct  8 23:31:43 devrp1 cib[2345]:    error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
> Oct  8 23:31:43 devrp1 cib[2345]:    error: cib_cs_destroy: Corosync connection lost!  Exiting.
> Oct  8 23:31:43 devrp1 stonith-ng[2346]:    error: stonith_peer_cs_destroy: Corosync connection terminated
> Oct  8 23:31:43 devrp1 dlm_controld[1752]: cluster is down, exiting
> Oct  8 23:31:43 devrp1 dlm_controld[1752]: daemon cpg_dispatch error 2
> Oct  8 23:31:43 devrp1 gfs_controld[1801]: cluster is down, exiting
> Oct  8 23:31:43 devrp1 crmd[2350]:   notice: crmd_exit: Forcing immediate exit: Link has been severed (67)
> Oct  8 23:31:43 devrp1 attrd[2348]:    error: attrd_cib_connection_destroy: Connection to the CIB terminated...
> Oct  8 23:31:43 devrp1 lrmd[2347]:  warning: qb_ipcs_event_sendv: new_event_notification (2347-2350-6): Bad file descriptor (9)
> Oct  8 23:31:43 devrp1 lrmd[2347]:  warning: send_client_notify: Notification of client crmd/94e94935-2221-434d-8a6f-90eba4ede55b failed
> Oct  8 23:31:43 devrp1 lrmd[2347]:  warning: send_client_notify: Notification of client crmd/94e94935-2221-434d-8a6f-90eba4ede55b failed
> 
> 
> Devrp2
> Oct  8 23:31:26 devrp2 kernel: IN=eth0 OUT= MAC=00:50:56:a6:3a:5d:00:00:00:00:00:00:08:00 SRC=10.0.0.0 DST=224.0.0.1 LEN=36 TOS=0x00 PREC=0x00 TTL=1 ID=0 PROTO=2
> Oct  8 23:31:42 devrp2 crmd[2361]:   notice: cman_event_callback: Membership 428: quorum lost
> Oct  8 23:31:43 devrp2 fenced[1747]: telling cman to remove nodeid 1 from cluster
> Oct  8 23:31:45 devrp2 corosync[1685]:   [TOTEM ] A processor failed, forming new configuration.
> Oct  8 23:31:45 devrp2 corosync[1685]:   [CMAN  ] quorum lost, blocking activity
> Oct  8 23:31:45 devrp2 corosync[1685]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
> Oct  8 23:31:45 devrp2 corosync[1685]:   [QUORUM] Members[1]: 2
> Oct  8 23:31:45 devrp2 corosync[1685]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Oct  8 23:31:45 devrp2 corosync[1685]:   [CMAN  ] quorum regained, resuming activity
> Oct  8 23:31:45 devrp2 corosync[1685]:   [QUORUM] This node is within the primary component and will provide service.
> Oct  8 23:31:45 devrp2 corosync[1685]:   [QUORUM] Members[2]: 1 2
> Oct  8 23:31:45 devrp2 corosync[1685]:   [QUORUM] Members[2]: 1 2
> Oct  8 23:31:45 devrp2 corosync[1685]:   [CPG   ] chosen downlist: sender r(0) ip(10.172.214.52) ; members(old:2 left:1)
> Oct  8 23:31:45 devrp2 corosync[1685]:   [MAIN  ] Completed service synchronization, ready to provide service.
> Oct  8 23:31:45 devrp2 fenced[1747]: receive_start 1:3 add node with started_count 1
> Oct  8 23:31:45 devrp2 crmd[2361]:   notice: cman_event_callback: Membership 428: quorum acquired
> Oct  8 23:31:45 devrp2 crmd[2361]:   notice: crm_update_peer_state: cman_event_callback: Node devrp1[1] - state is now lost (was member)
> Oct  8 23:31:45 devrp2 crmd[2361]:  warning: match_down_event: No match for shutdown action on devrp1
> Oct  8 23:31:45 devrp2 crmd[2361]:   notice: peer_update_callback: Stonith/shutdown of devrp1 not matched
> Oct  8 23:31:49 devrp2 crmd[2361]:   notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=check_join_state ]
> Oct  8 23:31:49 devrp2 crmd[2361]:  warning: match_down_event: No match for shutdown action on devrp1
> Oct  8 23:31:49 devrp2 crmd[2361]:   notice: peer_update_callback: Stonith/shutdown of devrp1 not matched
> Oct  8 23:31:49 devrp2 crmd[2361]:   notice: do_election_count_vote: Election 2 (current: 2, owner: devrp1): Processed vote from devrp1 (Peer is not part of our cluster)
> Oct  8 23:31:49 devrp2 pengine[2360]:   notice: unpack_config: On loss of CCM Quorum: Ignore
> Oct  8 23:31:49 devrp2 kernel: dlm: closing connection to node 1
> Oct  8 23:31:49 devrp2 attrd[2359]:   notice: attrd_local_callback: Sending full refresh (origin=crmd)
> Oct  8 23:31:49 devrp2 attrd[2359]:   notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-ybrpstat (1412658023)
> Oct  8 23:31:49 devrp2 dlm_controld[1773]: /sys/kernel/config/dlm/cluster/comms/1: rmdir failed: 2
> Oct  8 23:31:49 devrp2 attrd[2359]:   notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
> Oct  8 23:31:49 devrp2 pengine[2360]:   notice: LogActions: Start   ybrpip#011(devrp2)
> Oct  8 23:31:49 devrp2 pengine[2360]:   notice: process_pe_message: Calculated Transition 1061: /var/lib/pacemaker/pengine/pe-input-2282.bz2
> Oct  8 23:31:50 devrp2 pengine[2360]:   notice: unpack_config: On loss of CCM Quorum: Ignore
> Oct  8 23:31:50 devrp2 pengine[2360]:   notice: LogActions: Start   ybrpip#011(devrp2)
> Oct  8 23:31:50 devrp2 pengine[2360]:   notice: process_pe_message: Calculated Transition 1062: /var/lib/pacemaker/pengine/pe-input-2283.bz2
> Oct  8 23:31:50 devrp2 crmd[2361]:   notice: te_rsc_command: Initiating action 5: start ybrpip_start_0 on devrp2 (local)
> Oct  8 23:31:51 devrp2 IPaddr2(ybrpip)[30442]: INFO: Adding inet address 10.172.214.50/24 with broadcast address 10.172.214.255 to device eth0
> Oct  8 23:31:51 devrp2 IPaddr2(ybrpip)[30442]: INFO: Bringing device eth0 up
> Oct  8 23:31:51 devrp2 IPaddr2(ybrpip)[30442]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-agents/send_arp-10.172.214.50 eth0 10.172.214.50 auto not_used not_used
> Oct  8 23:31:51 devrp2 crmd[2361]:   notice: process_lrm_event: LRM operation ybrpip_start_0 (call=995, rc=0, cib-update=1494, confirmed=true) ok
> Oct  8 23:31:51 devrp2 crmd[2361]:   notice: te_rsc_command: Initiating action 6: monitor ybrpip_monitor_5000 on devrp2 (local)
> Oct  8 23:31:51 devrp2 crmd[2361]:   notice: process_lrm_event: LRM operation ybrpip_monitor_5000 (call=998, rc=0, cib-update=1495, confirmed=false) ok
> Oct  8 23:31:51 devrp2 crmd[2361]:   notice: run_graph: Transition 1062 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2283.bz2): Complete
> Oct  8 23:31:51 devrp2 crmd[2361]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> Oct  8 23:31:52 devrp2 ntpd[2099]: Listen normally on 6 eth0 10.172.214.50 UDP 123
> Oct  8 23:31:52 devrp2 ntpd[2099]: peers refreshed
> Oct  8 23:31:52 devrp2 corosync[1685]:   [TOTEM ] A processor failed, forming new configuration.
> Oct  8 23:31:54 devrp2 corosync[1685]:   [CMAN  ] quorum lost, blocking activity
> Oct  8 23:31:54 devrp2 corosync[1685]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
> Oct  8 23:31:54 devrp2 corosync[1685]:   [QUORUM] Members[1]: 2
> Oct  8 23:31:54 devrp2 corosync[1685]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Oct  8 23:31:54 devrp2 crmd[2361]:   notice: cman_event_callback: Membership 432: quorum lost
> Oct  8 23:31:54 devrp2 corosync[1685]:   [CPG   ] chosen downlist: sender r(0) ip(10.172.214.52) ; members(old:2 left:1)
> Oct  8 23:31:54 devrp2 corosync[1685]:   [MAIN  ] Completed service synchronization, ready to provide service.
> Oct  8 23:31:54 devrp2 crmd[2361]:  warning: match_down_event: No match for shutdown action on devrp1
> Oct  8 23:31:54 devrp2 crmd[2361]:   notice: peer_update_callback: Stonith/shutdown of devrp1 not matched
> Oct  8 23:31:54 devrp2 crmd[2361]:   notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=check_join_state ]
> Oct  8 23:31:54 devrp2 attrd[2359]:   notice: attrd_local_callback: Sending full refresh (origin=crmd)
> Oct  8 23:31:54 devrp2 attrd[2359]:   notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-ybrpstat (1412658023)
> Oct  8 23:31:55 devrp2 attrd[2359]:   notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
> Oct  8 23:31:55 devrp2 pengine[2360]:   notice: unpack_config: On loss of CCM Quorum: Ignore
> Oct  8 23:31:55 devrp2 pengine[2360]:   notice: process_pe_message: Calculated Transition 1063: /var/lib/pacemaker/pengine/pe-input-2284.bz2
> Oct  8 23:31:55 devrp2 crmd[2361]:   notice: run_graph: Transition 1063 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2284.bz2): Complete
> Oct  8 23:31:55 devrp2 crmd[2361]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> Oct  8 23:31:59 devrp2 kernel: IN=eth0 OUT= MAC=00:50:56:a6:3a:5d:00:00:00:00:00:00:08:00 SRC=10.0.0.0 DST=224.0.0.1 LEN=36 TOS=0x00 PREC=0x00 TTL=1 ID=0 PROTO=2
> Oct  8 23:36:52 devrp2 kernel: IN=eth0 OUT= MAC=00:50:56:a6:3a:5d:00:00:00:00:00:00:08:00 SRC=10.0.0.0 DST=224.0.0.1 LEN=36 TOS=0x00 PREC=0x00 TTL=1 ID=0 PROTO=2
> Oct  8 23:46:55 devrp2 crmd[2361]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Oct  8 23:46:55 devrp2 pengine[2360]:   notice: unpack_config: On loss of CCM Quorum: Ignore
> Oct  8 23:46:55 devrp2 pengine[2360]:   notice: process_pe_message: Calculated Transition 1064: /var/lib/pacemaker/pengine/pe-input-2284.bz2
> Oct  8 23:46:55 devrp2 crmd[2361]:   notice: run_graph: Transition 1064 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2284.bz2): Complete
> Oct  8 23:46:55 devrp2 crmd[2361]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20141010/4044c6b9/attachment-0003.sig>