[Pacemaker] [Problem]Number of times control of the fail-count is late.

Wed Nov 10 04:20:48 UTC 2010

Hi,

We constituted a cluster by two node constitution.
The migration-threshold set it to 2.

We confirmed a phenomenon in the next procedure.

Step1) Start two nodes and send config5.crm. (The clnDiskd-resources is original.)

============
Last updated: Tue Nov  9 21:10:49 2010
Stack: Heartbeat
Current DC: srv02 (8c93dc22-a27e-409b-8112-4073de622daf) - partition with quorum
Version: 1.0.9-0a40fd0cb9f2fcedef9d1967115c912314c57438
2 Nodes configured, unknown expected votes
5 Resources configured.
============

Online: [ srv01 srv02 ]

 vip	(ocf::heartbeat:IPaddr2):	Started srv01
 Clone Set: clnDiskd
     Started: [ srv01 srv02 ]
 Clone Set: clnDummy2
     Started: [ srv01 srv02 ]
 Clone Set: clnPingd1
     Started: [ srv01 srv02 ]

Node Attributes:
* Node srv01:
    + default_ping_set1               	: 100       
    + diskcheck_status_internal       	: normal    
* Node srv02:
    + default_ping_set1               	: 100       
    + diskcheck_status_internal       	: normal    

Migration summary:
* Node srv02: 
* Node srv01: 

Step2) We edit a clnDummy2 resource to raise time-out in start. (add sleep)

 dummy_start() {
    sleep 180 ----> add sleep
    dummy_monitor
    if [ $? =  $OCF_SUCCESS ]; then

Step3) It causes a monitor error in a clnDummy2 resource.

 # rm -rf /var/run/Dummy-Dummy2.state 

Step4) clnDummy2 causes time-out by restart.

But, as for clnDummy2, a lot of starts are up after time-out once when they watch log.
In fact, the reason is because pengine does not know that fail-count became INFINITY.

Because the reason is because fail-count does not yet become INFINITY in pe-input-2001.bz2.
In pe-input-2002.bz2, fail-count becomes INFINITY.

(snip)
Nov  9 21:12:35 srv02 crmd: [5896]: WARN: status_from_rc: Action 25 (Dummy2:0_start_0) on srv01 failed
(target: 0 vs. rc: -2): Error
Nov  9 21:12:35 srv02 crmd: [5896]: WARN: update_failcount: Updating failcount for Dummy2:0 on srv01
after failed start: rc=-2 (update=INFINITY, time=1289304755)
Nov  9 21:12:35 srv02 crmd: [5896]: info: abort_transition_graph: match_graph_event:272 - Triggered
transition abort (complete=0, tag=lrm_rsc_op, id=Dummy2:0_start_0,
magic=2:-2;25:5:0:275da7f9-7f43-43a2-8308-41d0ab78346e, cib=0.9.39) : Event failed
Nov  9 21:12:35 srv02 crmd: [5896]: info: match_graph_event: Action Dummy2:0_start_0 (25) confirmed on
srv01 (rc=4)
Nov  9 21:12:35 srv02 crmd: [5896]: info: te_pseudo_action: Pseudo action 29 fired and confirmed
Nov  9 21:12:35 srv02 crmd: [5896]: info: run_graph:
====================================================
Nov  9 21:12:35 srv02 crmd: [5896]: notice: run_graph: Transition 5 (Complete=7, Pending=0, Fired=0,
Skipped=1, Incomplete=0, Source=/var/lib/pengine/pe-input-2000.bz2): Stopped
Nov  9 21:12:35 srv02 crmd: [5896]: info: te_graph_trigger: Transition 5 is now complete
Nov  9 21:12:35 srv02 crmd: [5896]: info: do_state_transition: State transition S_TRANSITION_ENGINE ->
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
Nov  9 21:12:35 srv02 crmd: [5896]: info: do_state_transition: All 2 cluster nodes are eligible to run
resources.
Nov  9 21:12:35 srv02 crmd: [5896]: info: do_pe_invoke: Query 72: Requesting the current CIB:
S_POLICY_ENGINE
Nov  9 21:12:35 srv02 crmd: [5896]: info: do_pe_invoke_callback: Invoking the PE: query=72,
ref=pe_calc-dc-1289304755-58, seq=2, quorate=1
Nov  9 21:12:35 srv02 pengine: [7208]: notice: unpack_config: On loss of CCM Quorum: Ignore
Nov  9 21:12:35 srv02 pengine: [7208]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' =
0, 'green' = 0
Nov  9 21:12:35 srv02 pengine: [7208]: WARN: unpack_nodes: Blind faith: not fencing unseen nodes
Nov  9 21:12:35 srv02 pengine: [7208]: info: determine_online_status: Node srv02 is online
Nov  9 21:12:35 srv02 pengine: [7208]: info: determine_online_status: Node srv01 is online
Nov  9 21:12:35 srv02 pengine: [7208]: WARN: unpack_rsc_op: Processing failed op
Dummy2:0_monitor_15000 on srv01: not running (7)
Nov  9 21:12:35 srv02 pengine: [7208]: WARN: unpack_rsc_op: Processing failed op Dummy2:0_start_0 on
srv01: unknown exec error (-2)
Nov  9 21:12:35 srv02 pengine: [7208]: notice: native_print: Dummy	(ocf::pacemaker:Dummy):	Started
srv01
Nov  9 21:12:35 srv02 pengine: [7208]: notice: native_print: vip	(ocf::heartbeat:IPaddr2):	Started
srv01
Nov  9 21:12:35 srv02 pengine: [7208]: notice: clone_print:  Clone Set: clnDiskd
Nov  9 21:12:35 srv02 pengine: [7208]: notice: short_print:      Started: [ srv01 srv02 ]
Nov  9 21:12:35 srv02 pengine: [7208]: notice: clone_print:  Clone Set: clnDummy2
Nov  9 21:12:35 srv02 pengine: [7208]: notice: native_print:      Dummy2:0	(ocf::pacemaker:Dummy2):
Started srv01 FAILED
Nov  9 21:12:35 srv02 pengine: [7208]: notice: short_print:      Started: [ srv02 ]
Nov  9 21:12:35 srv02 pengine: [7208]: notice: clone_print:  Clone Set: clnPingd1
Nov  9 21:12:35 srv02 pengine: [7208]: notice: short_print:      Started: [ srv01 srv02 ]
Nov  9 21:12:35 srv02 pengine: [7208]: info: get_failcount: clnDummy2 has failed 1 times on srv01
Nov  9 21:12:35 srv02 pengine: [7208]: notice: common_apply_stickiness: clnDummy2 can fail 1 more
times on srv01 before being forced off
Nov  9 21:12:35 srv02 pengine: [7208]: info: get_failcount: clnDummy2 has failed 1 times on srv01
Nov  9 21:12:35 srv02 pengine: [7208]: notice: common_apply_stickiness: clnDummy2 can fail 1 more
times on srv01 before being forced off
Nov  9 21:12:35 srv02 pengine: [7208]: ERROR: unpack_operation: Specifying on_fail=fence and
stonith-enabled=false makes no sense
Nov  9 21:12:35 srv02 pengine: [7208]: notice: RecurringOp:  Start recurring monitor (15s) for
Dummy2:0 on srv01
Nov  9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource Dummy	(Started srv01)
Nov  9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource vip	(Started srv01)
Nov  9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource prmDiskd:0	(Started srv01)
Nov  9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource prmDiskd:1	(Started srv02)
Nov  9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Recover resource Dummy2:0	(Started srv01)
Nov  9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource Dummy2:1	(Started srv02)
Nov  9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource prmPingd1:0	(Started srv01)
Nov  9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource prmPingd1:1	(Started srv02)
Nov  9 21:12:35 srv02 crmd: [5896]: info: do_state_transition: State transition S_POLICY_ENGINE ->
S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Nov  9 21:12:35 srv02 crmd: [5896]: info: unpack_graph: Unpacked transition 6: 8 actions in 8 synapses
Nov  9 21:12:35 srv02 crmd: [5896]: info: do_te_invoke: Processing graph 6
(ref=pe_calc-dc-1289304755-58) derived from /var/lib/pengine/pe-input-2001.bz2
Nov  9 21:12:35 srv02 crmd: [5896]: info: te_pseudo_action: Pseudo action 30 fired and confirmed
Nov  9 21:12:35 srv02 crmd: [5896]: info: te_rsc_command: Initiating action 5: stop Dummy2:0_stop_0 on
srv01
Nov  9 21:12:35 srv02 pengine: [7208]: info: process_pe_message: Transition 6: PEngine Input stored
in: /var/lib/pengine/pe-input-2001.bz2
Nov  9 21:12:35 srv02 pengine: [7208]: info: process_pe_message: Configuration ERRORs found during PE
processing.  Please run "crm_verify -L" to identify issues.
Nov  9 21:12:37 srv02 attrd: [5895]: info: attrd_ha_callback: flush message from srv01
Nov  9 21:12:37 srv02 crmd: [5896]: info: match_graph_event: Action Dummy2:0_stop_0 (5) confirmed on
srv01 (rc=0)
Nov  9 21:12:37 srv02 crmd: [5896]: info: te_pseudo_action: Pseudo action 31 fired and confirmed
Nov  9 21:12:37 srv02 crmd: [5896]: info: te_pseudo_action: Pseudo action 8 fired and confirmed
Nov  9 21:12:37 srv02 crmd: [5896]: info: te_pseudo_action: Pseudo action 28 fired and confirmed
Nov  9 21:12:37 srv02 crmd: [5896]: info: te_rsc_command: Initiating action 24: start Dummy2:0_start_0
on srv01

 -----> Must not carry out this start.

Nov  9 21:12:37 srv02 crmd: [5896]: info: abort_transition_graph: te_update_diff:146 - Triggered
transition abort (complete=0, tag=transient_attributes, id=519bb7a2-3c31-414a-b6b2-eaef36a09fb7,
magic=NA, cib=0.9.41) : Transient attribute: update
Nov  9 21:12:37 srv02 crmd: [5896]: info: update_abort_priority: Abort priority upgraded from 0 to
1000000
Nov  9 21:12:37 srv02 crmd: [5896]: info: update_abort_priority: Abort action done superceeded by
restart
Nov  9 21:12:37 srv02 crmd: [5896]: info: abort_transition_graph: te_update_diff:146 - Triggered
transition abort (complete=0, tag=transient_attributes, id=519bb7a2-3c31-414a-b6b2-eaef36a09fb7,
magic=NA, cib=0.9.42) : Transient attribute: update
(snip)

It seems to be a problem that update of fail-count was late.
But, this problem seems to occur by a timing.

It affects it in fail over time of the resource that the control number of times of fail-count is
wrong.

Is this problem already discussed?
Is not a delay of the update of fail-count which went by way of attrd a problem?

 * I attach log and some pe-files at Bugzilla.
 * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2520 

Best Regards,
Hideo Yamauchi.