[Pacemaker] pre_notify_demote is issued twice

Mon Feb 17 21:23:34 EST 2014

On 6 Feb 2014, at 7:45 pm, Keisuke MORI <keisuke.mori+ha at gmail.com> wrote:

> Hi,
> 
> I observed that pre_notify_demote is issued twice when a master
> resource is migrating.
> I'm wondering if this is the correct behavior.
> 
> Steps to reproduce:
> 
> - Start up 2 nodes cluster configured for the PostgreSQL streaming
> replication using pgsql RA as  a master/slave resource.
> - kill the postgresql process on the master node to induce a fail-over.
> - The fail-over succeeds as expected, but pre_notify_demote was
> executed twice on each node before demoting on the master resource.
> 
> 100% reproducible on my cluster.
> 
> Pacemaker version: 1.1.11-rc4 (source build from the repo)
> OS: RHEL6.4
> 
> I have never seen this on Pacemaker-1.0.* cluster with the same configuration.
> 
> The relevant logs and pe-inputs are attached.
> 
> 
> Diagnostics:
> 
> (1) The first transition caused by the process failure (pe-input-160)
> initiates pre_notify_demote on both nodes and cancelling slave monitor
> on the slave node.
> {{{
> 171 Jan 30 16:08:59 rhel64-1 crmd[8143]:   notice: te_rsc_command:
> Initiating action 9: cancel prmPostgresql_cancel_10000 on rhel64-2
> 172 Jan 30 16:08:59 rhel64-1 crmd[8143]:   notice: te_rsc_command:
> Initiating action 79: notify prmPostgresql_pre_notify_demote_0 on
> rhel64-1 (local)
> 
> 175 Jan 30 16:08:59 rhel64-1 crmd[8143]:   notice: te_rsc_command:
> Initiating action 81: notify prmPostgresql_pre_notify_demote_0 on
> rhel64-2
> }}}
> 
> (2) When cancelling slave monitor completes, the transition is aborted
> by Resource op removal.
> {{{
> 176 Jan 30 16:08:59 rhel64-1 crmd[8143]:     info: match_graph_event:
> Action prmPostgresql_monitor_10000 (9) confirmed on rhel64-2 (rc=0)
> 177 Jan 30 16:08:59 rhel64-1 cib[8138]:     info: cib_process_request:
> Completed cib_delete operation for section status: OK (rc=0,
> origin=rhel64-2/crmd/21, version=0.37.9)
> 178 Jan 30 16:08:59 rhel64-1 crmd[8143]:     info:
> abort_transition_graph: te_update_diff:258 - Triggered transition
> abort (complete=0, node=rhel64-2, tag=lrm_rsc_op,
> id=prmPostgresql_monitor_10000,
> magic=0:0;26:12:0:acf9a2a3-307c-460b-b786-fc20e6b8aad5, cib=0.37.9) :
> Resource op removal
> }}}
> 
> (3) The second transition is calculated by the abort (pe-input-161)
> which results initiating pre_notify_demote again.

If the demote didn't complete (or wasn't even attempted), then we must send the pre_notify_demote again unfortunately.
The real bug may well be that the transition shouldn't have been aborted.

> {{{
> 227 Jan 30 16:09:01 rhel64-1 pengine[8142]:   notice:
> process_pe_message: Calculated Transition 15:
> /var/lib/pacemaker/pengine/pe-input-161.bz2
> 229 Jan 30 16:09:01 rhel64-1 crmd[8143]:   notice: te_rsc_command:
> Initiating action 78: notify prmPostgresql_pre_notify_demote_0 on
> rhel64-1 (local)
> 232 Jan 30 16:09:01 rhel64-1 crmd[8143]:   notice: te_rsc_command:
> Initiating action 80: notify prmPostgresql_pre_notify_demote_0 on
> rhel64-2
> }}}
> 
> I think that the transition abort at (2) should not happen.
> 
> Regards,
> -- 
> Keisuke MORI
> <logs-pre-notify-20140206.tar.bz2>_______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140218/394fffae/attachment-0003.sig>