[ClusterLabs] ClusterIP location constraint reappears after reboot
Jeremy Matthews
Jeremy.Matthews at genband.com
Mon Feb 22 23:23:44 UTC 2016
Thanks for the quick response again, and pardon for the delay in responding. A colleague of mine and I have been trying some different things today.
But from the reboot on Friday, further below are the logs from corosync.log from the time of the reboot command to the constraint being added.
I am not able to perform a "pcs cluster cib-upgrade". The version of pcs that I have does not have that option (just cib [filename] and cib-push <filename>). My versions at the time of these logs were:
[root at g5se-f3efce Packages]# pcs --version
0.9.90
[root at g5se-f3efce Packages]# pacemakerd --version
Pacemaker 1.1.11
Written by Andrew Beekhof
I think you're right in that we had a script banning the ClusterIP. It is called from a message daemon that we created that acts as middleware between the cluster software and our application. In this daemon, it has an exit handler that calls a script which runs:
pcs resource ban ClusterIP $host # where $host is the result of "host =`hostname`
...cause we normally try to push the cluster IP to the other side (though in this case, we just have one node), but then right after that the script calls:
pcs resource clear ClusterIP
...but for some reason, it doesn't seem to result in the constraint being removed (see even FURTHER below where I show a /var/log/message log snippet with both the constraint addition and removal; this was using an earlier version of pacemaker, Pacemaker 1.1.10-1.el6_4.4). I guess with the earlier pcs or pacemaker version, these logs went to messages rather than corosync.log today.
I am in a bit of a conundrum in that if I upgrade pcs to the 0.9.149 (retrieved and "make install" 'ed from github.com because 0.9.139 had a pcs issue with one node clusters) which has the cib-upgrade option), then if I manually remove the ClusterIP constraint this causes a problem for our message daemon in that it thinks neither side in the cluster is active; something to look at on our end. So it seems the removal of the constraint affects our daemon in the new pcs. For the time being, I've rolled back pcs to the above 0.9.90 version.
One other thing to mention is that the timing of pacemaker's start may have been delayed by what I found out was a change to its initialization header (by either our daemon or application installation script) from 90 1 to 70 20. So in /etc/rc3.d, there is S70pacemaker rather than S90pacemaker. I am not a Linux expert by any means. I guess that may affect start up, but I'm not sure about shutdown.
Corosync logs from the time reboot was issued to the constraint being added:
Feb 19 15:22:22 [1997] g5se-f3efce attrd: notice: attrd_trigger_update: Sending flush op to all hosts for: standby (true)
Feb 19 15:22:22 [1997] g5se-f3efce attrd: notice: attrd_perform_update: Sent update 24: standby=true
Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_process_request: Forwarding cib_modify operation for section status to master (origin=local/attrd/24)
Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: Diff: --- 0.291.2 2
Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: Diff: +++ 0.291.3 (null)
Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: + /cib: @num_updates=3
Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: ++ /cib/status/node_state[@id='g5se-f3efce']/transient_attributes[@id='g5se-f3efce']/instance_attributes[@id='status-g5se-f3efce']: <nvpair id="status-g5se-f3efce-standby" name="standby" value="true"/>
Feb 19 15:22:22 [1999] g5se-f3efce crmd: info: abort_transition_graph: Transition aborted by status-g5se-f3efce-standby, standby=true: Transient attribute change (create cib=0.291.3, source=te_update_diff:391, path=/cib/status/node_state[@id='g5se-f3efce']/transient_attributes[@id='g5se-f3efce']/instance_attributes[@id='status-g5se-f3efce'], 1)
Feb 19 15:22:22 [1999] g5se-f3efce crmd: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=g5se-f3efce/attrd/24, version=0.291.3)
Feb 19 15:22:22 [1998] g5se-f3efce pengine: notice: update_validation: pacemaker-1.2-style configuration is also valid for pacemaker-1.3
Feb 19 15:22:22 [1998] g5se-f3efce pengine: info: update_validation: Transformation upgrade-1.3.xsl successful
Feb 19 15:22:22 [1998] g5se-f3efce pengine: info: update_validation: Transformed the configuration from pacemaker-1.2 to pacemaker-2.0
Feb 19 15:22:22 [1998] g5se-f3efce pengine: info: cli_config_update: Your configuration was internally updated to the latest version (pacemaker-2.0)
Feb 19 15:22:22 [1998] g5se-f3efce pengine: notice: unpack_config: On loss of CCM Quorum: Ignore
Feb 19 15:22:22 [1998] g5se-f3efce pengine: info: unpack_status: Node g5se-f3efce is in standby-mode
Feb 19 15:22:22 [1998] g5se-f3efce pengine: info: determine_online_status: Node g5se-f3efce is standby
Feb 19 15:22:22 [1998] g5se-f3efce pengine: info: native_print: sw-ready-g5se-f3efce (ocf::pacemaker:GBmon): Started g5se-f3efce
Feb 19 15:22:22 [1998] g5se-f3efce pengine: info: native_print: meta-data (ocf::pacemaker:GBmon): Started g5se-f3efce
Feb 19 15:22:22 [1998] g5se-f3efce pengine: info: native_print: netmon (ocf::heartbeat:ethmonitor): Started g5se-f3efce
Feb 19 15:22:22 [1998] g5se-f3efce pengine: info: native_print: ClusterIP (ocf::heartbeat:IPaddr2): Started g5se-f3efce
Feb 19 15:22:22 [1998] g5se-f3efce pengine: info: native_color: Resource sw-ready-g5se-f3efce cannot run anywhere
Feb 19 15:22:22 [1998] g5se-f3efce pengine: info: native_color: Resource meta-data cannot run anywhere
Feb 19 15:22:22 [1998] g5se-f3efce pengine: info: native_color: Resource netmon cannot run anywhere
Feb 19 15:22:22 [1998] g5se-f3efce pengine: info: native_color: Resource ClusterIP cannot run anywhere
Feb 19 15:22:22 [1998] g5se-f3efce pengine: notice: LogActions: Stop sw-ready-g5se-f3efce (g5se-f3efce)
Feb 19 15:22:22 [1998] g5se-f3efce pengine: notice: LogActions: Stop meta-data (g5se-f3efce)
Feb 19 15:22:22 [1998] g5se-f3efce pengine: notice: LogActions: Stop netmon (g5se-f3efce)
Feb 19 15:22:22 [1998] g5se-f3efce pengine: notice: LogActions: Stop ClusterIP (g5se-f3efce)
Feb 19 15:22:22 [1999] g5se-f3efce crmd: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Feb 19 15:22:22 [1999] g5se-f3efce crmd: info: do_te_invoke: Processing graph 8 (ref=pe_calc-dc-1455920542-41) derived from /var/lib/pacemaker/pengine/pe-input-641.bz2
Feb 19 15:22:22 [1999] g5se-f3efce crmd: notice: te_rsc_command: Initiating action 8: stop sw-ready-g5se-f3efce_stop_0 on g5se-f3efce (local)
Feb 19 15:22:22 [1996] g5se-f3efce lrmd: info: cancel_recurring_action: Cancelling operation sw-ready-g5se-f3efce_monitor_10000
Feb 19 15:22:22 [1999] g5se-f3efce crmd: info: do_lrm_rsc_op: Performing key=8:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed op=sw-ready-g5se-f3efce_stop_0
Feb 19 15:22:22 [1996] g5se-f3efce lrmd: info: log_execute: executing - rsc:sw-ready-g5se-f3efce action:stop call_id:31
Feb 19 15:22:22 [1999] g5se-f3efce crmd: notice: te_rsc_command: Initiating action 9: stop meta-data_stop_0 on g5se-f3efce (local)
Feb 19 15:22:22 [1996] g5se-f3efce lrmd: info: cancel_recurring_action: Cancelling operation meta-data_monitor_60000
Feb 19 15:22:22 [1999] g5se-f3efce crmd: info: do_lrm_rsc_op: Performing key=9:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed op=meta-data_stop_0
Feb 19 15:22:22 [1996] g5se-f3efce lrmd: info: log_execute: executing - rsc:meta-data action:stop call_id:33
Feb 19 15:22:22 [1999] g5se-f3efce crmd: notice: te_rsc_command: Initiating action 10: stop netmon_stop_0 on g5se-f3efce (local)
Feb 19 15:22:22 [1996] g5se-f3efce lrmd: info: cancel_recurring_action: Cancelling operation netmon_monitor_10000
Feb 19 15:22:22 [1999] g5se-f3efce crmd: info: do_lrm_rsc_op: Performing key=10:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed op=netmon_stop_0
Feb 19 15:22:22 [1996] g5se-f3efce lrmd: info: log_execute: executing - rsc:netmon action:stop call_id:35
Feb 19 15:22:22 [1999] g5se-f3efce crmd: notice: te_rsc_command: Initiating action 11: stop ClusterIP_stop_0 on g5se-f3efce (local)
Feb 19 15:22:22 [1996] g5se-f3efce lrmd: info: cancel_recurring_action: Cancelling operation ClusterIP_monitor_30000
Feb 19 15:22:22 [1999] g5se-f3efce crmd: info: do_lrm_rsc_op: Performing key=11:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed op=ClusterIP_stop_0
Feb 19 15:22:22 [1996] g5se-f3efce lrmd: info: log_execute: executing - rsc:ClusterIP action:stop call_id:37
Feb 19 15:22:22 [1999] g5se-f3efce crmd: info: process_lrm_event: Operation sw-ready-g5se-f3efce_monitor_10000: Cancelled (node=g5se-f3efce, call=29, confirmed=true)
Feb 19 15:22:22 [1999] g5se-f3efce crmd: info: process_lrm_event: Operation meta-data_monitor_60000: Cancelled (node=g5se-f3efce, call=21, confirmed=true)
Feb 19 15:22:22 [1999] g5se-f3efce crmd: info: process_lrm_event: Operation netmon_monitor_10000: Cancelled (node=g5se-f3efce, call=23, confirmed=true)
Feb 19 15:22:22 [1999] g5se-f3efce crmd: info: process_lrm_event: Operation ClusterIP_monitor_30000: Cancelled (node=g5se-f3efce, call=25, confirmed=true)
Feb 19 15:22:22 [1998] g5se-f3efce pengine: notice: process_pe_message: Calculated Transition 8: /var/lib/pacemaker/pengine/pe-input-641.bz2
Feb 19 15:22:22 [1996] g5se-f3efce lrmd: info: log_finished: finished - rsc:sw-ready-g5se-f3efce action:stop call_id:31 pid:6013 exit-code:0 exec-time:56ms queue-time:0ms
Feb 19 15:22:22 [1999] g5se-f3efce crmd: notice: process_lrm_event: Operation sw-ready-g5se-f3efce_stop_0: ok (node=g5se-f3efce, call=31, rc=0, cib-update=72, confirmed=true)
Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_process_request: Forwarding cib_modify operation for section status to master (origin=local/crmd/72)
Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: Diff: --- 0.291.3 2
Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: Diff: +++ 0.291.4 (null)
Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: + /cib: @num_updates=4
Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: + /cib/status/node_state[@id='g5se-f3efce']/lrm[@id='g5se-f3efce']/lrm_resources/lrm_resource[@id='sw-ready-g5se-f3efce']/lrm_rsc_op[@id='sw-ready-g5se-f3efce_last_0']: @operation_key=sw-ready-g5se-f3efce_stop_0, @operation=stop, @transition-key=8:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed, @transition-magic=0:0;8:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed, @call-id=31, @last-run=1455920542, @last-rc-change=1455920542, @exec-time=56
Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=g5se-f3efce/crmd/72, version=0.291.4)
Feb 19 15:22:22 [1999] g5se-f3efce crmd: info: match_graph_event: Action sw-ready-g5se-f3efce_stop_0 (8) confirmed on g5se-f3efce (rc=0)
Feb 19 15:22:22 [1996] g5se-f3efce lrmd: info: log_finished: finished - rsc:meta-data action:stop call_id:33 pid:6014 exit-code:0 exec-time:72ms queue-time:0ms
Feb 19 15:22:22 [1999] g5se-f3efce crmd: notice: process_lrm_event: Operation meta-data_stop_0: ok (node=g5se-f3efce, call=33, rc=0, cib-update=73, confirmed=true)
Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_process_request: Forwarding cib_modify operation for section status to master (origin=local/crmd/73)
Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: Diff: --- 0.291.4 2
Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: Diff: +++ 0.291.5 (null)
Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: + /cib: @num_updates=5
Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: + /cib/status/node_state[@id='g5se-f3efce']/lrm[@id='g5se-f3efce']/lrm_resources/lrm_resource[@id='meta-data']/lrm_rsc_op[@id='meta-data_last_0']: @operation_key=meta-data_stop_0, @operation=stop, @crm-debug-origin=do_update_resource, @transition-key=9:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed, @transition-magic=0:0;9:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed, @call-id=33, @last-run=1455920542, @last-rc-change=1455920542, @exec-time=
Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=g5se-f3efce/crmd/73, version=0.291.5)
Feb 19 15:22:22 [1999] g5se-f3efce crmd: info: match_graph_event: Action meta-data_stop_0 (9) confirmed on g5se-f3efce (rc=0)
Feb 19 15:22:22 [1997] g5se-f3efce attrd: notice: attrd_trigger_update: Sending flush op to all hosts for: ethmonitor-eth0 (<null>)
Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_process_request: Forwarding cib_delete operation for section status to master (origin=local/attrd/26)
Feb 19 15:22:22 [1997] g5se-f3efce attrd: notice: attrd_perform_update: Sent delete 26: node=g5se-f3efce, attr=ethmonitor-eth0, id=<n/a>, set=(null), section=status
Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: Diff: --- 0.291.5 2
Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: Diff: +++ 0.291.6 (null)
Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: -- /cib/status/node_state[@id='g5se-f3efce']/transient_attributes[@id='g5se-f3efce']/instance_attributes[@id='status-g5se-f3efce']/nvpair[@id='status-g5se-f3efce-ethmonitor-eth0']
Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: + /cib: @num_updates=6
Feb 19 15:22:22 [1996] g5se-f3efce lrmd: info: log_finished: finished - rsc:netmon action:stop call_id:35 pid:6015 exit-code:0 exec-time:99ms queue-time:0ms
Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_process_request: Completed cib_delete operation for section status: OK (rc=0, origin=g5se-f3efce/attrd/26, version=0.291.6)
Feb 19 15:22:22 [1999] g5se-f3efce crmd: notice: abort_transition_graph: Transition aborted by deletion of nvpair[@id='status-g5se-f3efce-ethmonitor-eth0']: Transient attribute change (cib=0.291.6, source=te_update_diff:391, path=/cib/status/node_state[@id='g5se-f3efce']/transient_attributes[@id='g5se-f3efce']/instance_attributes[@id='status-g5se-f3efce']/nvpair[@id='status-g5se-f3efce-ethmonitor-eth0'], 0)
Feb 19 15:22:22 [1999] g5se-f3efce crmd: notice: process_lrm_event: Operation netmon_stop_0: ok (node=g5se-f3efce, call=35, rc=0, cib-update=74, confirmed=true)
Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_process_request: Forwarding cib_modify operation for section status to master (origin=local/crmd/74)
Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: Diff: --- 0.291.6 2
Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: Diff: +++ 0.291.7 (null)
Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: + /cib: @num_updates=7
Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: + /cib/status/node_state[@id='g5se-f3efce']/lrm[@id='g5se-f3efce']/lrm_resources/lrm_resource[@id='netmon']/lrm_rsc_op[@id='netmon_last_0']: @operation_key=netmon_stop_0, @operation=stop, @crm-debug-origin=do_update_resource, @transition-key=10:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed, @transition-magic=0:0;10:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed, @call-id=35, @last-run=1455920542, @last-rc-change=1455920542, @exec-time=99
Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=g5se-f3efce/crmd/74, version=0.291.7)
Feb 19 15:22:23 [1999] g5se-f3efce crmd: info: match_graph_event: Action netmon_stop_0 (10) confirmed on g5se-f3efce (rc=0)
Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_process_request: Forwarding cib_delete operation for section constraints to master (origin=local/crm_resource/3)
Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_process_request: Completed cib_delete operation for section constraints: OK (rc=0, origin=g5se-f3efce/crm_resource/3, version=0.291.7)
IPaddr2[6016]: 2016/02/19_15:22:23 INFO: IP status = ok, IP_CIP=
Feb 19 15:22:23 [1996] g5se-f3efce lrmd: info: log_finished: finished - rsc:ClusterIP action:stop call_id:37 pid:6016 exit-code:0 exec-time:127ms queue-time:0ms
Feb 19 15:22:23 [1999] g5se-f3efce crmd: notice: process_lrm_event: Operation ClusterIP_stop_0: ok (node=g5se-f3efce, call=37, rc=0, cib-update=75, confirmed=true)
Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_process_request: Forwarding cib_modify operation for section status to master (origin=local/crmd/75)
Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: Diff: --- 0.291.7 2
Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: Diff: +++ 0.291.8 (null)
Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: + /cib: @num_updates=8
Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: + /cib/status/node_state[@id='g5se-f3efce']/lrm[@id='g5se-f3efce']/lrm_resources/lrm_resource[@id='ClusterIP']/lrm_rsc_op[@id='ClusterIP_last_0']: @operation_key=ClusterIP_stop_0, @operation=stop, @transition-key=11:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed, @transition-magic=0:0;11:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed, @call-id=37, @last-run=1455920542, @last-rc-change=1455920542, @exec-time=127
Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=g5se-f3efce/crmd/75, version=0.291.8)
Feb 19 15:22:23 [1999] g5se-f3efce crmd: info: match_graph_event: Action ClusterIP_stop_0 (11) confirmed on g5se-f3efce (rc=0)
Feb 19 15:22:23 [1999] g5se-f3efce crmd: notice: run_graph: Transition 8 (Complete=4, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-641.bz2): Stopped
Feb 19 15:22:23 [1999] g5se-f3efce crmd: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
Feb 19 15:22:23 [1998] g5se-f3efce pengine: notice: update_validation: pacemaker-1.2-style configuration is also valid for pacemaker-1.3
Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: update_validation: Transformation upgrade-1.3.xsl successful
Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: update_validation: Transformed the configuration from pacemaker-1.2 to pacemaker-2.0
Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: cli_config_update: Your configuration was internally updated to the latest version (pacemaker-2.0)
Feb 19 15:22:23 [1998] g5se-f3efce pengine: notice: unpack_config: On loss of CCM Quorum: Ignore
Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: unpack_status: Node g5se-f3efce is in standby-mode
Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: determine_online_status: Node g5se-f3efce is standby
Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: native_print: sw-ready-g5se-f3efce (ocf::pacemaker:GBmon): Stopped
Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: native_print: meta-data (ocf::pacemaker:GBmon): Stopped
Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: native_print: netmon (ocf::heartbeat:ethmonitor): Stopped
Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: native_print: ClusterIP (ocf::heartbeat:IPaddr2): Stopped
Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: native_color: Resource sw-ready-g5se-f3efce cannot run anywhere
Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: native_color: Resource meta-data cannot run anywhere
Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: native_color: Resource netmon cannot run anywhere
Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: native_color: Resource ClusterIP cannot run anywhere
Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: LogActions: Leave sw-ready-g5se-f3efce (Stopped)
Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: LogActions: Leave meta-data (Stopped)
Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: LogActions: Leave netmon (Stopped)
Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: LogActions: Leave ClusterIP (Stopped)
Feb 19 15:22:23 [1999] g5se-f3efce crmd: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Feb 19 15:22:23 [1999] g5se-f3efce crmd: info: do_te_invoke: Processing graph 9 (ref=pe_calc-dc-1455920543-46) derived from /var/lib/pacemaker/pengine/pe-input-642.bz2
Feb 19 15:22:23 [1999] g5se-f3efce crmd: notice: run_graph: Transition 9 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-642.bz2): Complete
Feb 19 15:22:23 [1999] g5se-f3efce crmd: info: do_log: FSA: Input I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE
Feb 19 15:22:23 [1999] g5se-f3efce crmd: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Feb 19 15:22:23 [1998] g5se-f3efce pengine: notice: process_pe_message: Calculated Transition 9: /var/lib/pacemaker/pengine/pe-input-642.bz2
Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_process_request: Forwarding cib_modify operation for section constraints to master (origin=local/crm_resource/3)
Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: Diff: --- 0.291.8 2
Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: Diff: +++ 0.292.0 (null)
Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: + /cib: @epoch=292, @num_updates=0
Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: ++ /cib/configuration/constraints: <rsc_location id="cli-ban-ClusterIP-on-g5se-f3efce" rsc="ClusterIP" role="Started" node="g5se-f3efce" score="-INFINITY"/>
Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_process_request: Completed cib_modify operation for section constraints: OK (rc=0, origin=g5se-f3efce/crm_resource/3, version=0.292.0)
Feb 19 15:22:23 [1999] g5se-f3efce crmd: info: abort_transition_graph: Transition aborted by rsc_location.cli-ban-ClusterIP-on-g5se-f3efce 'create': Non-status change (cib=0.292.0, source=te_update_diff:383, path=/cib/configuration/constraints, 1)
Feb 19 15:22:23 [1999] g5se-f3efce crmd: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Feb 19 15:22:23 [1998] g5se-f3efce pengine: notice: update_validation: pacemaker-1.2-style configuration is also valid for pacemaker-1.3
Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: update_validation: Transformation upgrade-1.3.xsl successful
Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: update_validation: Transformed the configuration from pacemaker-1.2 to pacemaker-2.0
Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: cli_config_update: Your configuration was internally updated to the latest version (pacemaker-2.0)
Feb 19 15:22:23 [1998] g5se-f3efce pengine: notice: unpack_config: On loss of CCM Quorum: Ignore
Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: unpack_status: Node g5se-f3efce is in standby-mode
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
/var/log/messages snippet showing at the bottom addition and removal of constraint (this is with pcs 0.9.90 and pacemakerd 1.1.10-1.el6_4.4):
Feb 21 23:10:38 g5se-dea2b1 azMD[1584]: Sending INIT message to partner. Count 21
Feb 21 23:10:41 g5se-dea2b1 init: tty (/dev/tty1) main process (1732) killed by TERM signal
Feb 21 23:10:41 g5se-dea2b1 init: tty (/dev/tty2) main process (1734) killed by TERM signal
Feb 21 23:10:41 g5se-dea2b1 init: tty (/dev/tty3) main process (1736) killed by TERM signal
Feb 21 23:10:41 g5se-dea2b1 init: tty (/dev/tty4) main process (1738) killed by TERM signal
Feb 21 23:10:41 g5se-dea2b1 init: tty (/dev/tty5) main process (1740) killed by TERM signal
Feb 21 23:10:41 g5se-dea2b1 init: tty (/dev/tty6) main process (1742) killed by TERM signal
Feb 21 23:10:41 g5se-dea2b1 avahi-daemon[1473]: Got SIGTERM, quitting.
Feb 21 23:10:41 g5se-dea2b1 avahi-daemon[1473]: Leaving mDNS multicast group on interface eth0.IPv4 with address 172.20.240.124.
Feb 21 23:10:42 g5se-dea2b1 azMD[1584]: [azIntTrmHandler] Int Trm handler 15
Feb 21 23:10:42 g5se-dea2b1 azMD[1584]: [azExitHandler] exit handler
Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Feb 21 23:10:42 g5se-dea2b1 cib[1557]: notice: cib:diff: Diff: --- 0.66.3
Feb 21 23:10:42 g5se-dea2b1 cib[1557]: notice: cib:diff: Diff: +++ 0.67.1 44c794d4381e36ea4f5d51d0dd7fde1d
Feb 21 23:10:42 g5se-dea2b1 cib[1557]: notice: cib:diff: -- <cib admin_epoch="0" epoch="66" num_updates="3"/>
Feb 21 23:10:42 g5se-dea2b1 cib[1557]: notice: cib:diff: ++ <nvpair id="sw-ready-g5se-dea2b1-meta_attributes-target-role" name="target-role" value="Stopped"/>
Feb 21 23:10:42 g5se-dea2b1 stonith-ng[1558]: notice: unpack_config: On loss of CCM Quorum: Ignore
Feb 21 23:10:42 g5se-dea2b1 pengine[1561]: notice: unpack_config: On loss of CCM Quorum: Ignore
Feb 21 23:10:42 g5se-dea2b1 pengine[1561]: notice: LogActions: Stop sw-ready-g5se-dea2b1#011(g5se-dea2b1)
Feb 21 23:10:42 g5se-dea2b1 pengine[1561]: notice: process_pe_message: Calculated Transition 32: /var/lib/pacemaker/pengine/pe-input-134.bz2
Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: te_rsc_command: Initiating action 10: stop sw-ready-g5se-dea2b1_stop_0 on g5se-dea2b1 (local)
Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: process_lrm_event: LRM operation sw-ready-g5se-dea2b1_stop_0 (call=48, rc=0, cib-update=67, confirmed=true) ok
Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: run_graph: Transition 32 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-134.bz2): Complete
Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Feb 21 23:10:42 g5se-dea2b1 stonith-ng[1558]: notice: unpack_config: On loss of CCM Quorum: Ignore
Feb 21 23:10:42 g5se-dea2b1 cib[1557]: notice: cib:diff: Diff: --- 0.69.3
Feb 21 23:10:42 g5se-dea2b1 cib[1557]: notice: cib:diff: Diff: +++ 0.70.1 216351853e036a12a96b442b30522287
Feb 21 23:10:42 g5se-dea2b1 cib[1557]: notice: cib:diff: -- <cib admin_epoch="0" epoch="69" num_updates="3"/>
Feb 21 23:10:42 g5se-dea2b1 cib[1557]: notice: cib:diff: ++ <rsc_location id="cli-ban-ClusterIP-on-g5se-dea2b1" rsc="ClusterIP" role="Started" node="g5se-dea2b1" score="-INFINITY"/>
Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Feb 21 23:10:42 g5se-dea2b1 pengine[1561]: notice: unpack_config: On loss of CCM Quorum: Ignore
Feb 21 23:10:42 g5se-dea2b1 pengine[1561]: notice: LogActions: Stop ClusterIP#011(g5se-dea2b1)
Feb 21 23:10:42 g5se-dea2b1 pengine[1561]: notice: process_pe_message: Calculated Transition 35: /var/lib/pacemaker/pengine/pe-input-137.bz2
Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: te_rsc_command: Initiating action 7: stop ClusterIP_stop_0 on g5se-dea2b1 (local)
Feb 21 23:10:42 g5se-dea2b1 IPaddr2[13237]: INFO: IP status = ok, IP_CIP=
Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: process_lrm_event: LRM operation ClusterIP_stop_0 (call=64, rc=0, cib-update=74, confirmed=true) ok
Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: run_graph: Transition 35 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-137.bz2): Complete
Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Feb 21 23:10:42 g5se-dea2b1 stonith-ng[1558]: notice: unpack_config: On loss of CCM Quorum: Ignore
Feb 21 23:10:42 g5se-dea2b1 cib[1557]: notice: cib:diff: Diff: --- 0.70.2
Feb 21 23:10:42 g5se-dea2b1 cib[1557]: notice: cib:diff: Diff: +++ 0.71.1 453ef48657244dc188b444348eb547ed
Feb 21 23:10:42 g5se-dea2b1 cib[1557]: notice: cib:diff: -- <rsc_location id="cli-ban-ClusterIP-on-g5se-dea2b1" rsc="ClusterIP" role="Started" node="g5se-dea2b1" score="-INFINITY"/>
Feb 21 23:10:42 g5se-dea2b1 cib[1557]: notice: cib:diff: ++ <cib epoch="71" num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" cib-last-written="Sun Feb 21 23:10:42 2016" update-origin="g5se-dea2b1" update-client="crm_resource" crm_feature_set="3.0.7" have-quorum="1" dc-uuid="g5se-dea2b1"/>
Feb 21 23:10:42 g5se-dea2b1 pengine[1561]: notice: unpack_config: On loss of CCM Quorum: Ignore
Feb 21 23:10:42 g5se-dea2b1 pengine[1561]: notice: LogActions: Start ClusterIP#011(g5se-dea2b1)
Feb 21 23:10:42 g5se-dea2b1 pengine[1561]: notice: process_pe_message: Calculated Transition 36: /var/lib/pacemaker/pengine/pe-input-138.bz2
Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: te_rsc_command: Initiating action 6: start ClusterIP_start_0 on g5se-dea2b1 (local)
Feb 21 23:10:42 g5se-dea2b1 azMD[1584]: [azExitHandler] exit handler
Feb 21 23:10:42 g5se-dea2b1 IPaddr2[13282]: INFO: ip -f inet addr add 172.20.240.123/24 brd 172.20.240.255 dev eth0
Feb 21 23:10:42 g5se-dea2b1 IPaddr2[13282]: INFO: ip link set eth0 up
Feb 21 23:10:42 g5se-dea2b1 IPaddr2[13282]: INFO: /usr/lib64/heartbeat/send_arp -i 200 -r 5 -p /var/run/heartbeat/rsctmp/send_arp-172.20.240.123 eth0 172.20.240.123 auto not_used not_used
Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: process_lrm_event: LRM operation ClusterIP_start_0 (call=68, rc=0, cib-update=76, confirmed=true) ok
Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: te_rsc_command: Initiating action 7: monitor ClusterIP_monitor_30000 on g5se-dea2b1 (local)
Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: process_lrm_event: LRM operation ClusterIP_monitor_30000 (call=71, rc=0, cib-update=77, confirmed=false) ok
Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: run_graph: Transition 36 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-138.bz2): Complete
Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Feb 21 23:10:43 g5se-dea2b1 crmd[1562]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Feb 21 23:10:43 g5se-dea2b1 stonith-ng[1558]: notice: unpack_config: On loss of CCM Quorum: Ignore
Feb 21 23:10:43 g5se-dea2b1 cib[1557]: notice: cib:diff: Diff: --- 0.71.3
Feb 21 23:10:43 g5se-dea2b1 cib[1557]: notice: cib:diff: Diff: +++ 0.72.1 4e5a3b6259a59f84bcfec6d0f16ad3ba
Feb 21 23:10:43 g5se-dea2b1 cib[1557]: notice: cib:diff: -- <cib admin_epoch="0" epoch="71" num_updates="3"/>
Feb 21 23:10:43 g5se-dea2b1 cib[1557]: notice: cib:diff: ++ <rsc_location id="cli-ban-ClusterIP-on-g5se-dea2b1" rsc="ClusterIP" role="Started" node="g5se-dea2b1" score="-INFINITY"/>
Feb 21 23:10:43 g5se-dea2b1 pengine[1561]: notice: unpack_config: On loss of CCM Quorum: Ignore
Feb 21 23:10:43 g5se-dea2b1 pengine[1561]: notice: LogActions: Stop ClusterIP#011(g5se-dea2b1)
Feb 21 23:10:43 g5se-dea2b1 pengine[1561]: notice: process_pe_message: Calculated Transition 37: /var/lib/pacemaker/pengine/pe-input-139.bz2
Feb 21 23:10:43 g5se-dea2b1 crmd[1562]: notice: te_rsc_command: Initiating action 7: stop ClusterIP_stop_0 on g5se-dea2b1 (local)
Feb 21 23:10:43 g5se-dea2b1 IPaddr2[13372]: INFO: IP status = ok, IP_CIP=
Feb 21 23:10:43 g5se-dea2b1 crmd[1562]: notice: process_lrm_event: LRM operation ClusterIP_stop_0 (call=75, rc=0, cib-update=79, confirmed=true) ok
Feb 21 23:10:43 g5se-dea2b1 crmd[1562]: notice: run_graph: Transition 37 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-139.bz2): Complete
Feb 21 23:10:43 g5se-dea2b1 crmd[1562]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Feb 21 23:10:43 g5se-dea2b1 stonith-ng[1558]: notice: unpack_config: On loss of CCM Quorum: Ignore
Feb 21 23:10:43 g5se-dea2b1 cib[1557]: notice: cib:diff: Diff: --- 0.72.2
Feb 21 23:10:43 g5se-dea2b1 cib[1557]: notice: cib:diff: Diff: +++ 0.73.1 93f902fd51a6750b828144d42f8c7a6e
Feb 21 23:10:43 g5se-dea2b1 cib[1557]: notice: cib:diff: -- <rsc_location id="cli-ban-ClusterIP-on-g5se-dea2b1" rsc="ClusterIP" role="Started" node="g5se-dea2b1" score="-INFINITY"/>
Feb 21 23:10:43 g5se-dea2b1 cib[1557]: notice: cib:diff: ++ <cib epoch="73" num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" cib-last-written="Sun Feb 21 23:10:43 2016" update-origin="g5se-dea2b1" update-client="crm_resource" crm_feature_set="3.0.7" have-quorum="1" dc-uuid="g5se-dea2b1"/>
Feb 21 23:10:43 g5se-dea2b1 crmd[1562]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Feb 21 23:10:43 g5se-dea2b1 pengine[1561]: notice: unpack_config: On loss of CCM Quorum: Ignore
Feb 21 23:10:43 g5se-dea2b1 pengine[1561]: notice: LogActions: Start ClusterIP#011(g5se-dea2b1)
Feb 21 23:10:43 g5se-dea2b1 pengine[1561]: notice: process_pe_message: Calculated Transition 38: /var/lib/pacemaker/pengine/pe-input-140.bz2
Feb 21 23:10:43 g5se-dea2b1 crmd[1562]: notice: te_rsc_command: Initiating action 6: start ClusterIP_start_0 on g5se-dea2b1 (local)
-----Original Message-----
From: users-request at clusterlabs.org [mailto:users-request at clusterlabs.org]
Sent: Monday, February 22, 2016 11:42 AM
To: users at clusterlabs.org
Subject: Users Digest, Vol 13, Issue 44
Send Users mailing list submissions to
users at clusterlabs.org
To subscribe or unsubscribe via the World Wide Web, visit
http://clusterlabs.org/mailman/listinfo/users
or, via email, send a message with subject or body 'help' to
users-request at clusterlabs.org
You can reach the person managing the list at
users-owner at clusterlabs.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Users digest..."
Today's Topics:
1. Re: fencing by node name or by node ID (Ken Gaillot)
2. Re: ClusterIP location constraint reappears after reboot
(Ken Gaillot)
----------------------------------------------------------------------
Message: 1
Date: Mon, 22 Feb 2016 11:10:57 -0600
From: Ken Gaillot <kgaillot at redhat.com>
To: users at clusterlabs.org
Subject: Re: [ClusterLabs] fencing by node name or by node ID
Message-ID: <56CB4121.7000901 at redhat.com>
Content-Type: text/plain; charset=windows-1252
On 02/21/2016 06:19 PM, Ferenc W?gner wrote:
> Hi,
>
> Last night a node in our cluster (Corosync 2.3.5, Pacemaker 1.1.14)
> experienced some failure and fell out of the cluster:
>
> Feb 21 22:11:12 vhbl06 corosync[3603]: [TOTEM ] A new membership (10.0.6.9:612) was formed. Members left: 167773709
> Feb 21 22:11:12 vhbl06 corosync[3603]: [TOTEM ] Failed to receive the leave message. failed: 167773709
> Feb 21 22:11:12 vhbl06 attrd[8307]: notice: crm_update_peer_proc: Node vhbl07[167773709] - state is now lost (was member)
> Feb 21 22:11:12 vhbl06 cib[8304]: notice: crm_update_peer_proc: Node vhbl07[167773709] - state is now lost (was member)
> Feb 21 22:11:12 vhbl06 attrd[8307]: notice: Removing all vhbl07 attributes for attrd_peer_change_cb
> Feb 21 22:11:12 vhbl06 cib[8304]: notice: Removing vhbl07/167773709 from the membership list
> Feb 21 22:11:12 vhbl06 cib[8304]: notice: Purged 1 peers with id=167773709 and/or uname=vhbl07 from the membership cache
> Feb 21 22:11:12 vhbl06 attrd[8307]: notice: Lost attribute writer vhbl07
> Feb 21 22:11:12 vhbl06 attrd[8307]: notice: Removing vhbl07/167773709 from the membership list
> Feb 21 22:11:12 vhbl06 stonith-ng[8305]: notice: crm_update_peer_proc: Node vhbl07[167773709] - state is now lost (was member)
> Feb 21 22:11:12 vhbl06 attrd[8307]: notice: Purged 1 peers with id=167773709 and/or uname=vhbl07 from the membership cache
> Feb 21 22:11:12 vhbl06 crmd[8309]: notice: Our peer on the DC (vhbl07) is dead
> Feb 21 22:11:12 vhbl06 stonith-ng[8305]: notice: Removing vhbl07/167773709 from the membership list
> Feb 21 22:11:12 vhbl06 stonith-ng[8305]: notice: Purged 1 peers with id=167773709 and/or uname=vhbl07 from the membership cache
> Feb 21 22:11:12 vhbl06 crmd[8309]: notice: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
> Feb 21 22:11:12 vhbl06 corosync[3603]: [QUORUM] Members[4]: 167773705 167773706 167773707 167773708
> Feb 21 22:11:12 vhbl06 corosync[3603]: [MAIN ] Completed service synchronization, ready to provide service.
> Feb 21 22:11:12 vhbl06 crmd[8309]: notice: crm_reap_unseen_nodes: Node vhbl07[167773709] - state is now lost (was member)
> Feb 21 22:11:12 vhbl06 pacemakerd[8261]: notice: crm_reap_unseen_nodes: Node vhbl07[167773709] - state is now lost (was member)
> Feb 21 22:11:12 vhbl06 kernel: [343490.563365] dlm: closing connection to node 167773709
> Feb 21 22:11:12 vhbl06 stonith-ng[8305]: notice: fencing-vhbl05 can not fence (reboot) 167773709: static-list
> Feb 21 22:11:12 vhbl06 stonith-ng[8305]: notice: fencing-vhbl07 can not fence (reboot) 167773709: static-list
> Feb 21 22:11:12 vhbl06 stonith-ng[8305]: notice: fencing-vhbl01 can not fence (reboot) 167773709: static-list
> Feb 21 22:11:12 vhbl06 stonith-ng[8305]: notice: fencing-vhbl02 can not fence (reboot) 167773709: static-list
> Feb 21 22:11:12 vhbl06 stonith-ng[8305]: notice: fencing-vhbl03 can not fence (reboot) 167773709: static-list
> Feb 21 22:11:12 vhbl06 stonith-ng[8305]: notice: fencing-vhbl04 can not fence (reboot) 167773709: static-list
> Feb 21 22:11:12 vhbl06 stonith-ng[8305]: notice: Operation reboot of 167773709 by <no-one> for stonith-api.20937 at vhbl03.9c470723: No such device
> Feb 21 22:11:12 vhbl06 crmd[8309]: notice: Peer 167773709 was not terminated (reboot) by <anyone> for vhbl03: No such device (ref=9c470723-d318-4c7e-a705-ce9ee5c7ffe5) by client stonith-api.20937
> Feb 21 22:11:12 vhbl06 dlm_controld[3641]: 343352 tell corosync to remove nodeid 167773705 from cluster
> Feb 21 22:11:15 vhbl06 corosync[3603]: [TOTEM ] A processor failed, forming new configuration.
> Feb 21 22:11:19 vhbl06 corosync[3603]: [TOTEM ] A new membership (10.0.6.10:616) was formed. Members left: 167773705
> Feb 21 22:11:19 vhbl06 corosync[3603]: [TOTEM ] Failed to receive the leave message. failed: 167773705
>
> However, no fencing agent reported ability to fence the failing node
> (vhbl07), because stonith-ng wasn't looking it up by name, but by
> numeric ID (at least that's what the logs suggest to me), and the
> pcmk_host_list attributes contained strings like vhbl07.
>
> 1. Was it dlm_controld who requested the fencing?
>
> I suspect it because of the "dlm: closing connection to node
> 167773709" kernel message right before the stonith-ng logs. And
> dlm_controld really hasn't got anything to use but the corosync node
> ID.
Not based on this; dlm would print messages about fencing, with "dlm_controld.*fence request".
However it looks like these logs are not from the DC, which will say what process requested the fencing. It may be DLM or something else.
Also, DLM on any node might initiate fencing, so it's worth looking at all the nodes' logs around this time.
> 2. Shouldn't some component translate between node IDs and node names?
> Is this a configuration error in our setup? Should I include both in
> pcmk_host_list?
Yes, stonithd's create_remote_stonith_op() function will do the translation if the st_opt_cs_nodeid call option is set in the request XML. If that fails, you'll see a "Could not expand nodeid" warning in the log. That option is set by the kick() stonith API used by DLM, so it should happen automatically.
I'm not sure why it appears not to have worked here; logs from other nodes might help. Do corosync and pacemaker know the same node names?
That would be necessary to get the node name from corosync.
Have you tested fencing vhbl07 from the command line with stonith_admin to make sure fencing is configured correctly?
> 3. After the failed fence, why was 167773705 (vhbl03) removed from the
> cluster? Because it was chosen to execute the fencing operation, but
> failed?
dlm_controld explicitly requested it. I'm not familiar enough with DLM to know why. It doesn't sound like a good idea to me.
> The logs continue like this:
>
> Feb 21 22:11:19 vhbl06 attrd[8307]: notice: crm_update_peer_proc: Node vhbl03[167773705] - state is now lost (was member)
> Feb 21 22:11:19 vhbl06 attrd[8307]: notice: Removing all vhbl03 attributes for attrd_peer_change_cb
> Feb 21 22:11:19 vhbl06 attrd[8307]: notice: Removing vhbl03/167773705 from the membership list
> Feb 21 22:11:19 vhbl06 attrd[8307]: notice: Purged 1 peers with id=167773705 and/or uname=vhbl03 from the membership cache
> Feb 21 22:11:19 vhbl06 corosync[3603]: [QUORUM] Members[3]: 167773706 167773707 167773708
> Feb 21 22:11:19 vhbl06 corosync[3603]: [MAIN ] Completed service synchronization, ready to provide service.
> Feb 21 22:11:19 vhbl06 crmd[8309]: notice: crm_reap_unseen_nodes: Node vhbl03[167773705] - state is now lost (was member)
> Feb 21 22:11:19 vhbl06 crmd[8309]: notice: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_TIMER_POPPED origin=election_timeout_popped ]
> Feb 21 22:11:19 vhbl06 pacemakerd[8261]: notice: crm_reap_unseen_nodes: Node vhbl03[167773705] - state is now lost (was member)
> Feb 21 22:11:19 vhbl06 cib[8304]: notice: crm_update_peer_proc: Node vhbl03[167773705] - state is now lost (was member)
> Feb 21 22:11:19 vhbl06 cib[8304]: notice: Removing vhbl03/167773705 from the membership list
> Feb 21 22:11:19 vhbl06 cib[8304]: notice: Purged 1 peers with id=167773705 and/or uname=vhbl03 from the membership cache
> Feb 21 22:11:19 vhbl06 stonith-ng[8305]: notice: crm_update_peer_proc: Node vhbl03[167773705] - state is now lost (was member)
> Feb 21 22:11:19 vhbl06 stonith-ng[8305]: notice: Removing vhbl03/167773705 from the membership list
> Feb 21 22:11:19 vhbl06 stonith-ng[8305]: notice: Purged 1 peers with id=167773705 and/or uname=vhbl03 from the membership cache
> Feb 21 22:11:19 vhbl06 stonith-ng[8305]: notice: fencing-vhbl05 can not fence (reboot) 167773709: static-list
> Feb 21 22:11:19 vhbl06 stonith-ng[8305]: notice: fencing-vhbl07 can not fence (reboot) 167773709: static-list
> Feb 21 22:11:19 vhbl06 stonith-ng[8305]: notice: fencing-vhbl01 can not fence (reboot) 167773709: static-list
> Feb 21 22:11:19 vhbl06 stonith-ng[8305]: notice: fencing-vhbl02 can not fence (reboot) 167773709: static-list
> Feb 21 22:11:19 vhbl06 stonith-ng[8305]: notice: fencing-vhbl03 can not fence (reboot) 167773709: static-list
> Feb 21 22:11:19 vhbl06 stonith-ng[8305]: notice: fencing-vhbl04 can not fence (reboot) 167773709: static-list
> Feb 21 22:11:19 vhbl06 kernel: [343497.392381] dlm: closing connection
> to node 167773705
>
> 4. Why can't I see any action above to fence 167773705 (vhbl03)?
Only the DC and the node that executes the fence will have those logs.
The other nodes will just have the query results ("can/can not fence") and the final stonith result.
> Feb 21 22:11:19 vhbl06 crmd[8309]: warning: FSA: Input I_ELECTION_DC from do_election_check() received in state S_INTEGRATION
> Feb 21 22:11:19 vhbl06 stonith-ng[8305]: notice: Operation reboot of 167773709 by <no-one> for stonith-api.17462 at vhbl04.0cd1625d: No such device
> Feb 21 22:11:19 vhbl06 crmd[8309]: notice: Peer 167773709 was not terminated (reboot) by <anyone> for vhbl04: No such device (ref=0cd1625d-a61e-4f94-930d-bb80a10b89da) by client stonith-api.17462
> Feb 21 22:11:19 vhbl06 dlm_controld[3641]: 343359 tell corosync to remove nodeid 167773706 from cluster
> Feb 21 22:11:22 vhbl06 corosync[3603]: [TOTEM ] A processor failed, forming new configuration.
> Feb 21 22:11:26 vhbl06 corosync[3603]: [TOTEM ] A new membership (10.0.6.11:620) was formed. Members left: 167773706
> Feb 21 22:11:26 vhbl06 corosync[3603]: [TOTEM ] Failed to receive the leave message. failed: 167773706
>
> Looks like vhbl04 took over the job of fencing vhbl07 from vhbl03, and
> of course failed the exact same way. So it was expelled, too.
>
> Feb 21 22:11:26 vhbl06 attrd[8307]: notice: crm_update_peer_proc: Node vhbl04[167773706] - state is now lost (was member)
> Feb 21 22:11:26 vhbl06 cib[8304]: notice: crm_update_peer_proc: Node vhbl04[167773706] - state is now lost (was member)
> Feb 21 22:11:26 vhbl06 attrd[8307]: notice: Removing all vhbl04 attributes for attrd_peer_change_cb
> Feb 21 22:11:26 vhbl06 cib[8304]: notice: Removing vhbl04/167773706 from the membership list
> Feb 21 22:11:26 vhbl06 cib[8304]: notice: Purged 1 peers with id=167773706 and/or uname=vhbl04 from the membership cache
> Feb 21 22:11:26 vhbl06 attrd[8307]: notice: Removing vhbl04/167773706 from the membership list
> Feb 21 22:11:26 vhbl06 attrd[8307]: notice: Purged 1 peers with id=167773706 and/or uname=vhbl04 from the membership cache
> Feb 21 22:11:26 vhbl06 stonith-ng[8305]: notice: crm_update_peer_proc: Node vhbl04[167773706] - state is now lost (was member)
> Feb 21 22:11:26 vhbl06 stonith-ng[8305]: notice: Removing vhbl04/167773706 from the membership list
> Feb 21 22:11:26 vhbl06 crmd[8309]: warning: No match for shutdown action on 167773706
> Feb 21 22:11:26 vhbl06 stonith-ng[8305]: notice: Purged 1 peers with id=167773706 and/or uname=vhbl04 from the membership cache
> Feb 21 22:11:26 vhbl06 crmd[8309]: notice: Stonith/shutdown of vhbl04 not matched
> Feb 21 22:11:26 vhbl06 corosync[3603]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
> Feb 21 22:11:26 vhbl06 pacemakerd[8261]: notice: Membership 620: quorum lost (2)
> Feb 21 22:11:26 vhbl06 crmd[8309]: notice: Membership 620: quorum lost (2)
> Feb 21 22:11:26 vhbl06 corosync[3603]: [QUORUM] Members[2]: 167773707 167773708
>
> That, finally, was enough to lose quorum and paralize the cluster.
> Later, vhbl07 was rebooted by the hardware watchdog and came back for
> a cold welcome:
>
> Feb 21 22:24:53 vhbl06 corosync[3603]: [TOTEM ] A new membership (10.0.6.12:628) was formed. Members joined: 167773709
> Feb 21 22:24:53 vhbl06 corosync[3603]: [QUORUM] Members[2]: 167773708 167773709
> Feb 21 22:24:53 vhbl06 corosync[3603]: [MAIN ] Completed service synchronization, ready to provide service.
> Feb 21 22:24:53 vhbl06 crmd[8309]: notice: pcmk_quorum_notification: Node vhbl07[167773709] - state is now member (was lost)
> Feb 21 22:24:53 vhbl06 pacemakerd[8261]: notice: pcmk_quorum_notification: Node vhbl07[167773709] - state is now member (was lost)
> Feb 21 22:24:53 vhbl06 dlm_controld[3641]: 344173 daemon joined
> 167773709 needs fencing Feb 21 22:25:47 vhbl06 dlm_controld[3641]: 344226 clvmd wait for quorum
> Feb 21 22:29:26 vhbl06 crmd[8309]: notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Feb 21 22:29:27 vhbl06 pengine[8308]: notice: We do not have quorum - fencing and resource management disabled
> Feb 21 22:29:27 vhbl06 pengine[8308]: warning: Node vhbl04 is unclean
> because the node is no longer part of the cluster Feb 21 22:29:27
> vhbl06 pengine[8308]: warning: Node vhbl04 is unclean Feb 21 22:29:27
> vhbl06 pengine[8308]: warning: Node vhbl05 is unclean because the
> node is no longer part of the cluster Feb 21 22:29:27 vhbl06
> pengine[8308]: warning: Node vhbl05 is unclean Feb 21 22:29:27 vhbl06
> pengine[8308]: warning: Node vhbl07 is unclean because our peer
> process is no longer available Feb 21 22:29:27 vhbl06 pengine[8308]:
> warning: Node vhbl07 is unclean Feb 21 22:29:27 vhbl06 pengine[8308]:
> warning: Node vhbl03 is unclean because vm-niifdc is thought to be
> active there Feb 21 22:29:27 vhbl06 pengine[8308]: warning: Action vm-dogwood_stop_0 on vhbl03 is unrunnable (offline) Feb 21 22:29:27 vhbl06 pengine[8308]: warning: Action vm-niifidp_stop_0 on vhbl03 is unrunnable (offline) [...] Feb 21 22:29:27 vhbl06 pengine[8308]: warning: Node vhbl03 is unclean!
> Feb 21 22:29:27 vhbl06 pengine[8308]: warning: Node vhbl04 is unclean!
> Feb 21 22:29:27 vhbl06 pengine[8308]: warning: Node vhbl05 is unclean!
> Feb 21 22:29:27 vhbl06 pengine[8308]: notice: We can fence vhbl07 without quorum because they're in our membership
> Feb 21 22:29:27 vhbl06 pengine[8308]: warning: Scheduling Node vhbl07 for STONITH
> Feb 21 22:29:27 vhbl06 pengine[8308]: notice: Cannot fence unclean nodes until quorum is attained (or no-quorum-policy is set to ignore)
> [...]
> Feb 21 22:29:27 vhbl06 crmd[8309]: notice: Executing reboot fencing operation (212) on vhbl07 (timeout=60000)
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: Client crmd.8309.09cea2e7 wants to fence (reboot) 'vhbl07' with device '(any)'
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: Initiating remote operation reboot for vhbl07: 31b2023d-3fc5-419e-8490-91eb81254497 (0)
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: fencing-vhbl05 can not fence (reboot) vhbl07: static-list
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: fencing-vhbl07 can fence (reboot) vhbl07: static-list
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: fencing-vhbl01 can not fence (reboot) vhbl07: static-list
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: fencing-vhbl02 can not fence (reboot) vhbl07: static-list
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: fencing-vhbl03 can not fence (reboot) vhbl07: static-list
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: fencing-vhbl04 can not fence (reboot) vhbl07: static-list
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: fencing-vhbl05 can not fence (reboot) vhbl07: static-list
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: fencing-vhbl07 can fence (reboot) vhbl07: static-list
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: fencing-vhbl01 can not fence (reboot) vhbl07: static-list
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: fencing-vhbl02 can not fence (reboot) vhbl07: static-list
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: fencing-vhbl03 can not fence (reboot) vhbl07: static-list
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: fencing-vhbl04 can not fence (reboot) vhbl07: static-list
> Feb 21 22:29:27 vhbl06 dlm_controld[3641]: 344447 daemon remove
> 167773709 already needs fencing Feb 21 22:29:27 vhbl06 dlm_controld[3641]: 344447 tell corosync to remove nodeid 167773709 from cluster
> Feb 21 22:29:30 vhbl06 corosync[3603]: [TOTEM ] A processor failed, forming new configuration.
> Feb 21 22:29:34 vhbl06 corosync[3603]: [TOTEM ] A new membership (10.0.6.12:632) was formed. Members left: 167773709
> Feb 21 22:29:34 vhbl06 corosync[3603]: [TOTEM ] Failed to receive the leave message. failed: 167773709
> Feb 21 22:29:34 vhbl06 corosync[3603]: [QUORUM] Members[1]: 167773708
> Feb 21 22:29:34 vhbl06 corosync[3603]: [MAIN ] Completed service synchronization, ready to provide service.
> Feb 21 22:29:34 vhbl06 pacemakerd[8261]: notice: crm_reap_unseen_nodes: Node vhbl07[167773709] - state is now lost (was member)
> Feb 21 22:29:34 vhbl06 crmd[8309]: notice: crm_reap_unseen_nodes: Node vhbl07[167773709] - state is now lost (was member)
> Feb 21 22:29:34 vhbl06 kernel: [344592.424938] dlm: closing connection to node 167773709
> Feb 21 22:29:42 vhbl06 stonith-ng[8305]: notice: Operation 'reboot' [5533] (call 2 from crmd.8309) for host 'vhbl07' with device 'fencing-vhbl07' returned: 0 (OK)
> Feb 21 22:29:42 vhbl06 stonith-ng[8305]: notice: Operation reboot of vhbl07 by vhbl06 for crmd.8309 at vhbl06.31b2023d: OK
> Feb 21 22:29:42 vhbl06 crmd[8309]: notice: Stonith operation 2/212:1:0:d06e9743-b452-4b6a-b3a9-d352a4454269: OK (0)
> Feb 21 22:29:42 vhbl06 crmd[8309]: notice: Peer vhbl07 was terminated (reboot) by vhbl06 for vhbl06: OK (ref=31b2023d-3fc5-419e-8490-91eb81254497) by client crmd.8309
>
> That is, fencing by node name worked all right.
>
> I wonder if I understood the issue right and what would be the best
> way to avoid it in the future. Please advise.
>
------------------------------
Message: 2
Date: Mon, 22 Feb 2016 11:39:03 -0600
From: Ken Gaillot <kgaillot at redhat.com>
To: users at clusterlabs.org
Subject: Re: [ClusterLabs] ClusterIP location constraint reappears
after reboot
Message-ID: <56CB47B7.3060800 at redhat.com>
Content-Type: text/plain; charset=windows-1252
On 02/22/2016 07:26 AM, Jeremy Matthews wrote:
> Thank you, Ken Gaillot, for your response. Sorry for the delayed followup, but I have looked and looked at the scripts. There are a couple of scripts that have a pcs resource ban command, but they are not executed at the time of shutdown which is when I've discovered that the constraint is put back in. Our application software did not change on the system. We just updated pcs and pacemaker (and dependencies). I had to rollback pcs because it has an issue.
>
> Below is from /var/log/cluster/corosync.log. Any clues here as to why the constraint might have been added? In my other system without the pacemaker update, there is not the addition of the constraint.
It might help to see the entire log from the time you issued the reboot command to when the constraint was added.
Notice in the cib logs it says "origin=local/crm_resource". That means that crm_resource was what originally added the constraint (pcs resource ban calls crm_resource).
I'd be curious whether this makes a difference: after removing the constraint, run "pcs cib-upgrade". It shouldn't, but it's the only thing I can think of to try.
CIB schema versions change when new features are added that require new CIB syntax. pcs should automatically run cib-upgrade if you ever use a newer feature than your current CIB version supports. You don't really need to cib-upgrade explicitly, but it doesn't hurt, and it will get rid of those "Transformed the configuration" messages.
> Feb 19 15:22:23 [1999] g5se-f3efce crmd: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> Feb 19 15:22:23 [1999] g5se-f3efce crmd: info: do_te_invoke: Processing graph 9 (ref=pe_calc-dc-1455920543-46) derived from /var/lib/pacemaker/pengine/pe-input-642.bz2
> Feb 19 15:22:23 [1999] g5se-f3efce crmd: notice: run_graph: Transition 9 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-642.bz2): Complete
> Feb 19 15:22:23 [1999] g5se-f3efce crmd: info: do_log: FSA: Input I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE
> Feb 19 15:22:23 [1999] g5se-f3efce crmd: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> Feb 19 15:22:23 [1998] g5se-f3efce pengine: notice: process_pe_message: Calculated Transition 9: /var/lib/pacemaker/pengine/pe-input-642.bz2
> Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_process_request: Forwarding cib_modify operation for section constraints to master (origin=local/crm_resource/3)
> Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: Diff: --- 0.291.8 2
> Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: Diff: +++ 0.292.0 (null)
> Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: + /cib: @epoch=292, @num_updates=0
> Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: ++ /cib/configuration/constraints: <rsc_location id="cli-ban-ClusterIP-on-g5se-f3efce" rsc="ClusterIP" role="Started" node="g5se-f3efce" score="-INFINITY"/>
> Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_process_request: Completed cib_modify operation for section constraints: OK (rc=0, origin=g5se-f3efce/crm_resource/3, version=0.292.0)
> Feb 19 15:22:23 [1999] g5se-f3efce crmd: info: abort_transition_graph: Transition aborted by rsc_location.cli-ban-ClusterIP-on-g5se-f3efce 'create': Non-status change (cib=0.292.0, source=te_update_diff:383, path=/cib/configuration/constraints, 1)
> Feb 19 15:22:23 [1999] g5se-f3efce crmd: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Feb 19 15:22:23 [1998] g5se-f3efce pengine: notice: update_validation: pacemaker-1.2-style configuration is also valid for pacemaker-1.3
> Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: update_validation: Transformation upgrade-1.3.xsl successful
> Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: update_validation: Transformed the configuration from pacemaker-1.2 to pacemaker-2.0
> Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: cli_config_update: Your configuration was internally updated to the latest version (pacemaker-2.0)
> Feb 19 15:22:23 [1998] g5se-f3efce pengine: notice: unpack_config: On loss of CCM Quorum: Ignore
> Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: unpack_status: Node g5se-f3efce is in standby-mode
>
> I'm not sure what all has to be included my original email and Ken Gaillot's response embedded in it below.
>
> Message: 3
> Date: Thu, 18 Feb 2016 13:37:31 -0600
> From: Ken Gaillot <kgaillot at redhat.com>
> To: users at clusterlabs.org
> Subject: Re: [ClusterLabs] ClusterIP location constraint reappears
> after reboot
> Message-ID: <56C61D7B.9090005 at redhat.com>
> Content-Type: text/plain; charset=windows-1252
>
> On 02/18/2016 01:07 PM, Jeremy Matthews wrote:
>> Hi,
>>
>> We're having an issue with our cluster where after a reboot of our system a location constraint reappears for the ClusterIP. This causes a problem, because we have a daemon that checks the cluster state and waits until the ClusterIP is started before it kicks off our application. We didn't have this issue when using an earlier version of pacemaker. Here is the constraint as shown by pcs:
>>
>> [root at g5se-f3efce cib]# pcs constraint Location Constraints:
>> Resource: ClusterIP
>> Disabled on: g5se-f3efce (role: Started) Ordering Constraints:
>> Colocation Constraints:
>>
>> ...and here is our cluster status with the ClusterIP being Stopped:
>>
>> [root at g5se-f3efce cib]# pcs status
>> Cluster name: cl-g5se-f3efce
>> Last updated: Thu Feb 18 11:36:01 2016 Last change: Thu Feb 18
>> 10:48:33 2016 via crm_resource on g5se-f3efce
>> Stack: cman
>> Current DC: g5se-f3efce - partition with quorum
>> Version: 1.1.11-97629de
>> 1 Nodes configured
>> 4 Resources configured
>>
>>
>> Online: [ g5se-f3efce ]
>>
>> Full list of resources:
>>
>> sw-ready-g5se-f3efce (ocf::pacemaker:GBmon): Started g5se-f3efce
>> meta-data (ocf::pacemaker:GBmon): Started g5se-f3efce
>> netmon (ocf::heartbeat:ethmonitor): Started g5se-f3efce
>> ClusterIP (ocf::heartbeat:IPaddr2): Stopped
>>
>>
>> The cluster really just has one node at this time.
>>
>> I retrieve the constraint ID, remove the constraint, verify that ClusterIP is started, and then reboot:
>>
>> [root at g5se-f3efce cib]# pcs constraint ref ClusterIP
>> Resource: ClusterIP
>> cli-ban-ClusterIP-on-g5se-f3efce
>> [root at g5se-f3efce cib]# pcs constraint remove
>> cli-ban-ClusterIP-on-g5se-f3efce
>>
>> [root at g5se-f3efce cib]# pcs status
>> Cluster name: cl-g5se-f3efce
>> Last updated: Thu Feb 18 11:45:09 2016 Last change: Thu Feb 18
>> 11:44:53 2016 via crm_resource on g5se-f3efce
>> Stack: cman
>> Current DC: g5se-f3efce - partition with quorum
>> Version: 1.1.11-97629de
>> 1 Nodes configured
>> 4 Resources configured
>>
>>
>> Online: [ g5se-f3efce ]
>>
>> Full list of resources:
>>
>> sw-ready-g5se-f3efce (ocf::pacemaker:GBmon): Started g5se-f3efce
>> meta-data (ocf::pacemaker:GBmon): Started g5se-f3efce
>> netmon (ocf::heartbeat:ethmonitor): Started g5se-f3efce
>> ClusterIP (ocf::heartbeat:IPaddr2): Started g5se-f3efce
>>
>>
>> [root at g5se-f3efce cib]# reboot
>>
>> ....after reboot, log in, and the constraint is back and ClusterIP has not started.
>>
>>
>> I have noticed in /var/lib/pacemaker/cib that the cib-x.raw files get created when there are changes to the cib (cib.xml). After a reboot, I see the constraint being added in a diff between .raw files:
>>
>> [root at g5se-f3efce cib]# diff cib-7.raw cib-8.raw
>> 1c1
>> < <cib epoch="239" num_updates="0" admin_epoch="0"
>> validate-with="pacemaker-1.2" cib-last-written="Thu Feb 18 11:44:53
>> 2016" update-origin="g5se-f3efce" update-client="crm_resource"
>> crm_feature_set="3.0.9" have-quorum="1" dc-uuid="g5se-f3efce">
>> ---
>>> <cib epoch="240" num_updates="0" admin_epoch="0"
>>> validate-with="pacemaker-1.2" cib-last-written="Thu Feb 18 11:46:49
>>> 2016" update-origin="g5se-f3efce" update-client="crm_resource"
>>> crm_feature_set="3.0.9" have-quorum="1" dc-uuid="g5se-f3efce">
>> 50c50,52
>> < <constraints/>
>> ---
>>> <constraints>
>>> <rsc_location id="cli-ban-ClusterIP-on-g5se-f3efce" rsc="ClusterIP" role="Started" node="g5se-f3efce" score="-INFINITY"/>
>>> </constraints>
>>
>>
>> I have also looked in /var/log/cluster/corosync.log and seen logs where it seems the cib is getting updated. I'm not sure if the constraint is being put back in at shutdown or at start up. I just don't understand why it's being put back in. I don't think our daemon code or other scripts are doing this, but it is something I could verify.
>
> I would look at any scripts running around that time first. Constraints that start with "cli-" were created by one of the CLI tools, so something must be calling it. The most likely candidates are pcs resource move/ban or crm_resource -M/--move/-B/--ban.
>
>> ********************************
>>
>> From "yum info pacemaker", my current version is:
>>
>> Name : pacemaker
>> Arch : x86_64
>> Version : 1.1.12
>> Release : 8.el6_7.2
>>
>> My earlier version was:
>>
>> Name : pacemaker
>> Arch : x86_64
>> Version : 1.1.10
>> Release : 1.el6_4.4
>>
>> I'm still using an earlier version pcs, because the new one seems to have issues with python:
>>
>> Name : pcs
>> Arch : noarch
>> Version : 0.9.90
>> Release : 1.0.1.el6.centos
>>
>> *******************************
>>
>> If anyone has ideas on the cause or thoughts on this, anything would be greatly appreciated.
>>
>> Thanks!
>>
>>
>>
>> Jeremy Matthews
>
> -----Original Message-----
> From: users-request at clusterlabs.org
> [mailto:users-request at clusterlabs.org]
> Sent: Friday, February 19, 2016 2:21 AM
> To: users at clusterlabs.org
> Subject: Users Digest, Vol 13, Issue 35
>
> Send Users mailing list submissions to
> users at clusterlabs.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://clusterlabs.org/mailman/listinfo/users
> or, via email, send a message with subject or body 'help' to
> users-request at clusterlabs.org
>
> You can reach the person managing the list at
> users-owner at clusterlabs.org
>
> When replying, please edit your Subject line so it is more specific than "Re: Contents of Users digest..."
>
>
> Today's Topics:
>
> 1. Re: Too quick node reboot leads to failed corosync assert on
> other node(s) (Michal Koutn?)
> 2. ClusterIP location constraint reappears after reboot
> (Jeremy Matthews)
> 3. Re: ClusterIP location constraint reappears after reboot
> (Ken Gaillot)
> 4. Re: Too quick node reboot leads to failed corosync assert on
> other node(s) (Jan Friesse)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 18 Feb 2016 17:32:48 +0100
> From: Michal Koutn? <mkoutny at suse.com>
> To: users at clusterlabs.org
> Subject: Re: [ClusterLabs] Too quick node reboot leads to failed
> corosync assert on other node(s)
> Message-ID: <56C5F230.6020206 at suse.com>
> Content-Type: text/plain; charset="windows-1252"
>
> On 02/18/2016 10:40 AM, Christine Caulfield wrote:
>> I definitely remember looking into this, or something very like it,
>> ages ago. I can't find anything in the commit logs for either
>> corosync or cman that looks relevant though. If you're seeing it on
>> recent builds then it's obviously still a problem anyway and we ought to look into it!
> Thanks for you replies.
>
> So far this happened only once and we've done only "post mortem", alas no available reproducer. If I have time, I'll try to reproduce it locally and check whether it exists in the current version.
>
> Michal
>
> -------------- next part -------------- A non-text attachment was
> scrubbed...
> Name: signature.asc
> Type: application/pgp-signature
> Size: 819 bytes
> Desc: OpenPGP digital signature
> URL:
> <http://clusterlabs.org/pipermail/users/attachments/20160218/97908c9d/
> attachment-0001.sig>
>
> ------------------------------
>
> Message: 2
> Date: Thu, 18 Feb 2016 19:07:19 +0000
> From: Jeremy Matthews <Jeremy.Matthews at genband.com>
> To: "users at clusterlabs.org" <users at clusterlabs.org>
> Subject: [ClusterLabs] ClusterIP location constraint reappears after
> reboot
> Message-ID:
> <BA3FCED1D982A94AA64964F08B104956012D760AA4 at gbplmail01.genband.com>
> Content-Type: text/plain; charset="windows-1252"
>
> Hi,
>
> We're having an issue with our cluster where after a reboot of our system a location constraint reappears for the ClusterIP. This causes a problem, because we have a daemon that checks the cluster state and waits until the ClusterIP is started before it kicks off our application. We didn't have this issue when using an earlier version of pacemaker. Here is the constraint as shown by pcs:
>
> [root at g5se-f3efce cib]# pcs constraint Location Constraints:
> Resource: ClusterIP
> Disabled on: g5se-f3efce (role: Started) Ordering Constraints:
> Colocation Constraints:
>
> ...and here is our cluster status with the ClusterIP being Stopped:
>
> [root at g5se-f3efce cib]# pcs status
> Cluster name: cl-g5se-f3efce
> Last updated: Thu Feb 18 11:36:01 2016 Last change: Thu Feb 18
> 10:48:33 2016 via crm_resource on g5se-f3efce
> Stack: cman
> Current DC: g5se-f3efce - partition with quorum
> Version: 1.1.11-97629de
> 1 Nodes configured
> 4 Resources configured
>
>
> Online: [ g5se-f3efce ]
>
> Full list of resources:
>
> sw-ready-g5se-f3efce (ocf::pacemaker:GBmon): Started g5se-f3efce
> meta-data (ocf::pacemaker:GBmon): Started g5se-f3efce
> netmon (ocf::heartbeat:ethmonitor): Started g5se-f3efce
> ClusterIP (ocf::heartbeat:IPaddr2): Stopped
>
>
> The cluster really just has one node at this time.
>
> I retrieve the constraint ID, remove the constraint, verify that ClusterIP is started, and then reboot:
>
> [root at g5se-f3efce cib]# pcs constraint ref ClusterIP
> Resource: ClusterIP
> cli-ban-ClusterIP-on-g5se-f3efce
> [root at g5se-f3efce cib]# pcs constraint remove
> cli-ban-ClusterIP-on-g5se-f3efce
>
> [root at g5se-f3efce cib]# pcs status
> Cluster name: cl-g5se-f3efce
> Last updated: Thu Feb 18 11:45:09 2016 Last change: Thu Feb 18
> 11:44:53 2016 via crm_resource on g5se-f3efce
> Stack: cman
> Current DC: g5se-f3efce - partition with quorum
> Version: 1.1.11-97629de
> 1 Nodes configured
> 4 Resources configured
>
>
> Online: [ g5se-f3efce ]
>
> Full list of resources:
>
> sw-ready-g5se-f3efce (ocf::pacemaker:GBmon): Started g5se-f3efce
> meta-data (ocf::pacemaker:GBmon): Started g5se-f3efce
> netmon (ocf::heartbeat:ethmonitor): Started g5se-f3efce
> ClusterIP (ocf::heartbeat:IPaddr2): Started g5se-f3efce
>
>
> [root at g5se-f3efce cib]# reboot
>
> ....after reboot, log in, and the constraint is back and ClusterIP has not started.
>
>
> I have noticed in /var/lib/pacemaker/cib that the cib-x.raw files get created when there are changes to the cib (cib.xml). After a reboot, I see the constraint being added in a diff between .raw files:
>
> [root at g5se-f3efce cib]# diff cib-7.raw cib-8.raw
> 1c1
> < <cib epoch="239" num_updates="0" admin_epoch="0"
> validate-with="pacemaker-1.2" cib-last-written="Thu Feb 18 11:44:53
> 2016" update-origin="g5se-f3efce" update-client="crm_resource"
> crm_feature_set="3.0.9" have-quorum="1" dc-uuid="g5se-f3efce">
> ---
>> <cib epoch="240" num_updates="0" admin_epoch="0"
>> validate-with="pacemaker-1.2" cib-last-written="Thu Feb 18 11:46:49
>> 2016" update-origin="g5se-f3efce" update-client="crm_resource"
>> crm_feature_set="3.0.9" have-quorum="1" dc-uuid="g5se-f3efce">
> 50c50,52
> < <constraints/>
> ---
>> <constraints>
>> <rsc_location id="cli-ban-ClusterIP-on-g5se-f3efce" rsc="ClusterIP" role="Started" node="g5se-f3efce" score="-INFINITY"/>
>> </constraints>
>
>
> I have also looked in /var/log/cluster/corosync.log and seen logs where it seems the cib is getting updated. I'm not sure if the constraint is being put back in at shutdown or at start up. I just don't understand why it's being put back in. I don't think our daemon code or other scripts are doing this, but it is something I could verify.
>
> ********************************
>
>>From "yum info pacemaker", my current version is:
>
> Name : pacemaker
> Arch : x86_64
> Version : 1.1.12
> Release : 8.el6_7.2
>
> My earlier version was:
>
> Name : pacemaker
> Arch : x86_64
> Version : 1.1.10
> Release : 1.el6_4.4
>
> I'm still using an earlier version pcs, because the new one seems to have issues with python:
>
> Name : pcs
> Arch : noarch
> Version : 0.9.90
> Release : 1.0.1.el6.centos
>
> *******************************
>
> If anyone has ideas on the cause or thoughts on this, anything would be greatly appreciated.
>
> Thanks!
>
>
>
> Jeremy Matthews
>
>
> -------------- next part -------------- An HTML attachment was
> scrubbed...
> URL:
> <http://clusterlabs.org/pipermail/users/attachments/20160218/8a4b99fd/
> attachment-0001.html>
>
> ------------------------------
>
> Message: 3
> Date: Thu, 18 Feb 2016 13:37:31 -0600
> From: Ken Gaillot <kgaillot at redhat.com>
> To: users at clusterlabs.org
> Subject: Re: [ClusterLabs] ClusterIP location constraint reappears
> after reboot
> Message-ID: <56C61D7B.9090005 at redhat.com>
> Content-Type: text/plain; charset=windows-1252
>
> On 02/18/2016 01:07 PM, Jeremy Matthews wrote:
>> Hi,
>>
>> We're having an issue with our cluster where after a reboot of our system a location constraint reappears for the ClusterIP. This causes a problem, because we have a daemon that checks the cluster state and waits until the ClusterIP is started before it kicks off our application. We didn't have this issue when using an earlier version of pacemaker. Here is the constraint as shown by pcs:
>>
>> [root at g5se-f3efce cib]# pcs constraint Location Constraints:
>> Resource: ClusterIP
>> Disabled on: g5se-f3efce (role: Started) Ordering Constraints:
>> Colocation Constraints:
>>
>> ...and here is our cluster status with the ClusterIP being Stopped:
>>
>> [root at g5se-f3efce cib]# pcs status
>> Cluster name: cl-g5se-f3efce
>> Last updated: Thu Feb 18 11:36:01 2016 Last change: Thu Feb 18
>> 10:48:33 2016 via crm_resource on g5se-f3efce
>> Stack: cman
>> Current DC: g5se-f3efce - partition with quorum
>> Version: 1.1.11-97629de
>> 1 Nodes configured
>> 4 Resources configured
>>
>>
>> Online: [ g5se-f3efce ]
>>
>> Full list of resources:
>>
>> sw-ready-g5se-f3efce (ocf::pacemaker:GBmon): Started g5se-f3efce
>> meta-data (ocf::pacemaker:GBmon): Started g5se-f3efce
>> netmon (ocf::heartbeat:ethmonitor): Started g5se-f3efce
>> ClusterIP (ocf::heartbeat:IPaddr2): Stopped
>>
>>
>> The cluster really just has one node at this time.
>>
>> I retrieve the constraint ID, remove the constraint, verify that ClusterIP is started, and then reboot:
>>
>> [root at g5se-f3efce cib]# pcs constraint ref ClusterIP
>> Resource: ClusterIP
>> cli-ban-ClusterIP-on-g5se-f3efce
>> [root at g5se-f3efce cib]# pcs constraint remove
>> cli-ban-ClusterIP-on-g5se-f3efce
>>
>> [root at g5se-f3efce cib]# pcs status
>> Cluster name: cl-g5se-f3efce
>> Last updated: Thu Feb 18 11:45:09 2016 Last change: Thu Feb 18
>> 11:44:53 2016 via crm_resource on g5se-f3efce
>> Stack: cman
>> Current DC: g5se-f3efce - partition with quorum
>> Version: 1.1.11-97629de
>> 1 Nodes configured
>> 4 Resources configured
>>
>>
>> Online: [ g5se-f3efce ]
>>
>> Full list of resources:
>>
>> sw-ready-g5se-f3efce (ocf::pacemaker:GBmon): Started g5se-f3efce
>> meta-data (ocf::pacemaker:GBmon): Started g5se-f3efce
>> netmon (ocf::heartbeat:ethmonitor): Started g5se-f3efce
>> ClusterIP (ocf::heartbeat:IPaddr2): Started g5se-f3efce
>>
>>
>> [root at g5se-f3efce cib]# reboot
>>
>> ....after reboot, log in, and the constraint is back and ClusterIP has not started.
>>
>>
>> I have noticed in /var/lib/pacemaker/cib that the cib-x.raw files get created when there are changes to the cib (cib.xml). After a reboot, I see the constraint being added in a diff between .raw files:
>>
>> [root at g5se-f3efce cib]# diff cib-7.raw cib-8.raw
>> 1c1
>> < <cib epoch="239" num_updates="0" admin_epoch="0"
>> validate-with="pacemaker-1.2" cib-last-written="Thu Feb 18 11:44:53
>> 2016" update-origin="g5se-f3efce" update-client="crm_resource"
>> crm_feature_set="3.0.9" have-quorum="1" dc-uuid="g5se-f3efce">
>> ---
>>> <cib epoch="240" num_updates="0" admin_epoch="0"
>>> validate-with="pacemaker-1.2" cib-last-written="Thu Feb 18 11:46:49
>>> 2016" update-origin="g5se-f3efce" update-client="crm_resource"
>>> crm_feature_set="3.0.9" have-quorum="1" dc-uuid="g5se-f3efce">
>> 50c50,52
>> < <constraints/>
>> ---
>>> <constraints>
>>> <rsc_location id="cli-ban-ClusterIP-on-g5se-f3efce" rsc="ClusterIP" role="Started" node="g5se-f3efce" score="-INFINITY"/>
>>> </constraints>
>>
>>
>> I have also looked in /var/log/cluster/corosync.log and seen logs where it seems the cib is getting updated. I'm not sure if the constraint is being put back in at shutdown or at start up. I just don't understand why it's being put back in. I don't think our daemon code or other scripts are doing this, but it is something I could verify.
>
> I would look at any scripts running around that time first. Constraints that start with "cli-" were created by one of the CLI tools, so something must be calling it. The most likely candidates are pcs resource move/ban or crm_resource -M/--move/-B/--ban.
>
>> ********************************
>>
>> From "yum info pacemaker", my current version is:
>>
>> Name : pacemaker
>> Arch : x86_64
>> Version : 1.1.12
>> Release : 8.el6_7.2
>>
>> My earlier version was:
>>
>> Name : pacemaker
>> Arch : x86_64
>> Version : 1.1.10
>> Release : 1.el6_4.4
>>
>> I'm still using an earlier version pcs, because the new one seems to have issues with python:
>>
>> Name : pcs
>> Arch : noarch
>> Version : 0.9.90
>> Release : 1.0.1.el6.centos
>>
>> *******************************
>>
>> If anyone has ideas on the cause or thoughts on this, anything would be greatly appreciated.
>>
>> Thanks!
>>
>>
>>
>> Jeremy Matthews
>
>
>
>
> ------------------------------
>
> Message: 4
> Date: Fri, 19 Feb 2016 09:18:22 +0100
> From: Jan Friesse <jfriesse at redhat.com>
> To: Cluster Labs - All topics related to open-source clustering
> welcomed <users at clusterlabs.org>
> Subject: Re: [ClusterLabs] Too quick node reboot leads to failed
> corosync assert on other node(s)
> Message-ID: <56C6CFCE.5060305 at redhat.com>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Michal Koutn? napsal(a):
>> On 02/18/2016 10:40 AM, Christine Caulfield wrote:
>>> I definitely remember looking into this, or something very like it,
>>> ages ago. I can't find anything in the commit logs for either
>>> corosync or cman that looks relevant though. If you're seeing it on
>>> recent builds then it's obviously still a problem anyway and we ought to look into it!
>> Thanks for you replies.
>>
>> So far this happened only once and we've done only "post mortem",
>> alas no available reproducer. If I have time, I'll try to reproduce
>> it
>
> Ok. Actually I was trying to reproduce and was really not successful (current master). Steps I've used:
> - 2 nodes, token set to 30 sec
> - execute cpgbench on node2
> - pause node1 corosync (ctrl+z), kill node1 corosync (kill -9 %1)
> - wait until corosync on node2 move into "entering GATHER state from..."
> - execute corosync on node1
>
> Basically during recovery new node trans list was never send (and/or ignored by node2).
>
> I'm going to try test v1.4.7, but it's also possible that bug is fixed
> by other commits (my favorites are
> cfbb021e130337603fe5b545d1e377296ecb92ea,
> 4ee84c51fa73c4ec7cbee922111a140a3aaf75df,
> f135b680967aaef1d466f40170c75ae3e470e147).
>
> Regards,
> Honza
>
>> locally and check whether it exists in the current version.
>>
>> Michal
------------------------------
_______________________________________________
Users mailing list
Users at clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
End of Users Digest, Vol 13, Issue 44
*************************************
More information about the Users
mailing list