[Pacemaker] DC marks remote operation as timed out, leading to stonith

Tue Aug 9 19:27:35 UTC 2011

Hi,

I do not know if this is related, but I've just seen something "similar"
on a different two-node cluster. For some reason (out of scope here)
corosync ring was broken, one cluster node was fenced, then resources
began to start on the another node (s01-0). Then I see the following:
=============
Aug  9 18:17:21 s01-0 crmd: [1728]: info: abort_transition_graph:
te_update_diff:149 - Triggered transition abort (complete=0, tag=nvpair,
id=status-s01-0-last-failure-config-templates-lv, magic=NA,
cib=0.6816.474) : Transient attribute: update
Aug  9 18:17:21 s01-0 crmd: [1728]: info: update_abort_priority: Abort
priority upgraded from 1 to 1000000
Aug  9 18:17:21 s01-0 crmd: [1728]: info: update_abort_priority: 'Event
failed' abort superceeded
Aug  9 18:17:36 s01-0 pengine: [1727]: WARN: unpack_rsc_op: Processing
failed op config-templates-lv_start_0 on s01-0: not running (7)
Aug  9 18:17:36 s01-0 pengine: [1727]: WARN: unpack_rsc_op: Processing
failed op config-templates-lv_stop_0 on s01-0: unknown error (1)
Aug  9 18:17:36 s01-0 pengine: [1727]: WARN: pe_fence_node: Node s01-0
will be fenced to recover from resource failure(s)
=============
But, config-templates-lv_stop_0 was never run.

At the same time (18:17:22) pacemaker began to stop all other logical
volumes resources (except config-templates-lv) and had success.

09.08.2011 11:40, Vladislav Bogdanov wrote:
> Hi Andrew, all,
> 
> Just found little bit strange crmd behaviour - DC timed out stop
> operation (which lead to node fencing) by itself, without even waiting
> for lrmd (on another node) to finish that operation.
> 
> Here is what it did:
> ===============
> Aug  9 07:30:07 mgmt01 crmd: [15781]: WARN: action_timer_callback: Timer
> popped (timeout=600000, abort_level=1000000, complete=false)
> Aug  9 07:30:07 mgmt01 crmd: [15781]: ERROR: print_elem: Aborting
> transition, action lost: [Action 228]: In-flight (id:
> vptest22.vds-ok.com-vm_stop_0, loc: v03-a, priority: 0)
> Aug  9 07:30:07 mgmt01 crmd: [15781]: info: abort_transition_graph:
> action_timer_callback:486 - Triggered transition abort (complete=0) :
> Action lost
> Aug  9 07:30:07 mgmt01 crmd: [15781]: WARN: cib_action_update: rsc_op
> 228: vptest22.vds-ok.com-vm_stop_0 on v03-a timed out
> Aug  9 07:30:07 mgmt01 crmd: [15781]: info: create_operation_update:
> cib_action_update: Updating resouce vptest22.vds-ok.com-vm after Timed
> Out stop op (interval=0)
> Aug  9 07:30:07 mgmt01 crmd: [15781]: notice: run_graph: Transition 1500
> (Complete=64, Pending=0, Fired=0, Skipped=40, Incomplete=0,
> Source=/var/lib/pengine/pe-input-19.bz2): Stopped
> Aug  9 07:30:07 mgmt01 crmd: [15781]: info: te_graph_trigger: Transition
> 1500 is now complete
> Aug  9 07:30:07 mgmt01 crmd: [15781]: info: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Aug  9 07:30:07 mgmt01 crmd: [15781]: info: do_state_transition: All 3
> cluster nodes are eligible to run resources.
> Aug  9 07:30:07 mgmt01 crmd: [15781]: info: process_graph_event: Action
> vptest22.vds-ok.com-vm_stop_0 arrived after a completed transition
> Aug  9 07:30:07 mgmt01 crmd: [15781]: info: abort_transition_graph:
> process_graph_event:482 - Triggered transition abort (complete=1,
> tag=lrm_rsc_op, id=vptest22.vds-ok.com-vm_last_failure_0,
> magic=2:1;228:1500:0:c9c55d75-3a1e-4fcc-af7f-81ede42fa918,
> cib=0.4519.88) : Inactive graph
> Aug  9 07:30:07 mgmt01 crmd: [15781]: WARN: update_failcount: Updating
> failcount for vptest22.vds-ok.com-vm on v03-a after failed stop: rc=1
> (update=INFINITY, time=1312875007)
> Aug  9 07:30:08 mgmt01 crmd: [15781]: info: abort_transition_graph:
> te_update_diff:142 - Triggered transition abort (complete=1, tag=nvpair,
> id=status-v03-a-fail-count-vptest22.vds-ok.com-vm,
> name=fail-count-vptest22.vds-ok.com-vm, value=INFINITY, magic=NA,
> cib=0.4519.91) : Transient attribute: update
> Aug  9 07:30:10 mgmt01 pengine: [15780]: WARN: pe_fence_node: Node v03-a
> will be fenced to recover from resource failure(s)
> Aug  9 07:30:10 mgmt01 pengine: [15780]: WARN: unpack_rsc_op: Processing
> failed op vptest22.vds-ok.com-vm_last_failure_0 on v03-a: unknown error (1)
> ==============
> 
> That was a mass stop of 60 virtual machines on 2 12-core nodes. I
> suspect that nodes had very high load that time, so it is possible that
> crmd and lrmd on node being fenced had no chance to get enough CPU
> cycles from scheduler. I saw messages from lrmd there that some
> operations stayed in op list for a very long time (~10 min). But, last
> such message was logged 1 minute before DC decided to timeout that stop
> operation. During that minute only some informative messages from some
> resource agents were logged, nothing from lrmd or crmd.
> 
> What could cause such behaviuor? High CPU load?
> 
> Is it intended that DC may do that (mark operation as timed out)?
> 
> What can be done to prevent such things in future? Bigger scheduling
> priority (up to real-time) for lrmd (or for whole pacemaker subsystem)?
> Bigger timeouts?
> 
> Should I post hb_report for that (if it is not intended of course)?
> 
> Thank you,
> Vladislav
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker