[Pacemaker] pacemaker counts probe failure twice

Wed Oct 29 17:32:24 EDT 2014

> On 29 Oct 2014, at 10:01 pm, Andrei Borzenkov <arvidjaar at gmail.com> wrote:
> 
> I observe strange behavior that I cannot understand. Pacemaker 1.1.11-3ca8c3b.
> 
> There is master/slave resource running. Maintenance-mode was set,
> pacemaker restarted, maintenance-mode reset. This specific RA returns
> Slave instead of Master for the first probe. But what happens later is
> rather strange.
> 
> So on startup pacemaker initiates probe:
> 
> Oct 29 11:25:59 n1 crmd[2879]:   notice: te_rsc_command: Initiating
> action 3: monitor rsc_SAPHana_HDB_HDB00:0_monitor_0 on n1 (local)
> Oct 29 11:25:59 n1 crmd[2879]:   notice: te_rsc_command: Initiating
> action 4: monitor rsc_SAPHanaTopology_HDB_HDB00:0_monitor_0 on n1
> (local)
> 
> That's fine. Agents return OK (correctly for clone
> rsc_SAPHanaTopology_HDB, may be incorrectly for master/slave
> rsc_SAPHana_HDB):
> 
> Oct 29 11:26:03 n1 crmd[2879]:   notice: process_lrm_event: LRM
> operation rsc_SAPHanaTopology_HDB_HDB00_monitor_0 (call=12, rc=0,
> cib-update=49, confirmed=true) ok
> Oct 29 11:26:03 n1 crmd[2879]:   notice: process_lrm_event: LRM
> operation rsc_SAPHana_HDB_HDB00_monitor_0 (call=7, rc=0,
> cib-update=50, confirmed=true) ok
> 
> But pacemaker suddenly counts each of them twice? Why?

From the looks of it, the crmd is getting the updates from the CIB twice.
It looks like this is a SUSE install right? 

Probably best to poke them directly about this, they will know exactly what went into their build, will be able to compare it to the current upstream and backport patches as needed.

> 
> Oct 29 11:26:03 n1 crmd[2879]:  warning: status_from_rc: Action 4
> (rsc_SAPHanaTopology_HDB_HDB00:0_monitor_0) on n1 failed (target: 7
> vs. rc: 0): Error
> Oct 29 11:26:03 n1 crmd[2879]:  warning: status_from_rc: Action 4
> (rsc_SAPHanaTopology_HDB_HDB00:0_monitor_0) on n1 failed (target: 7
> vs. rc: 0): Error
> 
> Oct 29 11:26:03 n1 crmd[2879]:  warning: status_from_rc: Action 3
> (rsc_SAPHana_HDB_HDB00:0_monitor_0) on n1 failed (target: 7 vs. rc:
> 0): Error
> Oct 29 11:26:03 n1 crmd[2879]:  warning: status_from_rc: Action 3
> (rsc_SAPHana_HDB_HDB00:0_monitor_0) on n1 failed (target: 7 vs. rc:
> 0): Error
> 
> The maintenance mode gets reset and pacemaker starts monitor
> 
> Oct 29 11:32:06 n1 crmd[2879]:   notice: te_rsc_command: Initiating
> action 29: monitor rsc_SAPHana_HDB_HDB00_monitor_61000 on n1 (local)
> 
> which now properly indicates Master status
> 
> Oct 29 11:32:09 n1 crmd[2879]:   notice: process_lrm_event: LRM
> operation rsc_SAPHana_HDB_HDB00_monitor_61000 (call=55, rc=8,
> cib-update=110, confirmed=false) master
> 
> which is being interpreted as error due to unexpectedness?
> 
> Oct 29 11:32:09 n1 crmd[2879]:  warning: status_from_rc: Action 29
> (rsc_SAPHana_HDB_HDB00_monitor_61000) on n1 failed (target: 0 vs. rc:
> 8): Error
> Oct 29 11:32:09 n1 crmd[2879]:  warning: update_failcount: Updating
> failcount for rsc_SAPHana_HDB_HDB00 on n1 after failed monitor: rc=8
> (update=value++, time=1414571529)
> Oct 29 11:32:09 n1 attrd[2877]:   notice: attrd_trigger_update:
> Sending flush op to all hosts for: fail-count-rsc_SAPHana_HDB_HDB00
> (1)
> 
> Fine, so far it is logical. But where second error and increase
> failure count suddenly comes from? We started just one monitor so far
> ...
> 
> 
> Oct 29 11:32:09 n1 attrd[2877]:   notice: attrd_perform_update: Sent
> update 33: fail-count-rsc_SAPHana_HDB_HDB00=1
> Oct 29 11:32:09 n1 attrd[2877]:   notice: attrd_trigger_update:
> Sending flush op to all hosts for: last-failure-rsc_SAPHana_HDB_HDB00
> (1414571529)
> Oct 29 11:32:09 n1 attrd[2877]:   notice: attrd_perform_update: Sent
> update 35: last-failure-rsc_SAPHana_HDB_HDB00=1414571529
> Oct 29 11:32:09 n1 attrd[2877]:   notice: attrd_trigger_update:
> Sending flush op to all hosts for: fail-count-rsc_SAPHana_HDB_HDB00
> (2)
> 
> Now it starts recovery by stopping and then starting master.
> 
> Oct 29 11:32:09 n1 pengine[2878]:   notice: LogActions: Recover
> rsc_SAPHana_HDB_HDB00:0 (Master n1)
> Oct 29 11:32:09 n1 crmd[2879]:   notice: te_rsc_command: Initiating
> action 38: demote rsc_SAPHana_HDB_HDB00_demote_0 on n1 (local)
> Oct 29 11:32:10 n1 crmd[2879]:   notice: process_lrm_event: LRM
> operation rsc_SAPHana_HDB_HDB00_demote_0 (call=60, rc=0,
> cib-update=116, confirmed=true) ok
> Oct 29 11:32:10 n1 crmd[2879]:   notice: te_rsc_command: Initiating
> action 6: stop rsc_SAPHana_HDB_HDB00_stop_0 on n1 (local)
> Oct 29 11:32:21 n1 crmd[2879]:   notice: process_lrm_event: LRM
> operation rsc_SAPHana_HDB_HDB00_stop_0 (call=61, rc=0, cib-update=118,
> confirmed=true) ok
> Oct 29 11:32:21 n1 pengine[2878]:   notice: LogActions: Start
> rsc_SAPHana_HDB_HDB00:1 (n1)
> Oct 29 11:32:21 n1 crmd[2879]:   notice: te_rsc_command: Initiating
> action 38: start rsc_SAPHana_HDB_HDB00:1_start_0 on n1 (local)
> Oct 29 11:33:12 n1 crmd[2879]:   notice: process_lrm_event: LRM
> operation rsc_SAPHana_HDB_HDB00_start_0 (call=62, rc=0,
> cib-update=121, confirmed=true) ok
> Oct 29 11:33:12 n1 pengine[2878]:   notice: LogActions: Promote
> rsc_SAPHana_HDB_HDB00:0 (Slave -> Master n1)
> Oct 29 11:33:12 n1 crmd[2879]:   notice: te_rsc_command: Initiating
> action 38: promote rsc_SAPHana_HDB_HDB00_promote_0 on n1 (local
> Oct 29 11:33:14 n1 crmd[2879]:   notice: process_lrm_event: LRM
> operation rsc_SAPHana_HDB_HDB00_promote_0 (call=63, rc=0,
> cib-update=124, confirmed=true) ok
> 
> So we successfully restarted and promoted local instance. And suddenly ...
> 
> Oct 29 11:33:14 n1 pengine[2878]:  warning: unpack_rsc_op_failure:
> Processing failed op monitor for rsc_SAPHana_HDB_HDB00:0 on n1: master
> (8)
> Oct 29 11:33:14 n1 pengine[2878]:   notice: LogActions: Demote
> rsc_SAPHana_HDB_HDB00:0 (Master -> Slave n1)
> Oct 29 11:33:14 n1 pengine[2878]:   notice: LogActions: Promote
> rsc_SAPHana_HDB_HDB00:1 (Slave -> Master n2)
> 
> 
> Why!?!
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org