[Pacemaker] The strange behavior of Master/Slave when it failed to demote
renayama19661014 at ybb.ne.jp
renayama19661014 at ybb.ne.jp
Wed Jan 23 05:11:06 UTC 2013
Hi All,
I registered a problem at bugzilla in place of Miss Ikeda.
* http://bugs.clusterlabs.org/show_bug.cgi?id=5133
Best Regards,
Hideo Yamauchi.
--- On Thu, 2013/1/10, Junko IKEDA <tsukishima.ha at gmail.com> wrote:
>
>
> Hi,
>
> I'm running Stateful RA with Pacemaker 1.0.12, and found that its demote behavior is something wrong.
>
> This is my configuration;
> There is no stonith devices, and demote/stop are set as on-fail="block".
>
> # crm configure show
> node $id="21c624bd-c426-43dc-9665-bbfb92054bcd" dl380g5c \
> node $id="3f6ec88d-ee47-4f63-bfeb-652b8dd96027" dl380g5d
> primitive dummy ocf:pacemaker:Stateful \
> op start interval="0s" timeout="100s" on-fail="restart" \
> op monitor interval="10s" role="Master" timeout="100s" on-fail="restart" \
> op monitor interval="20s" role="Slave" timeout="100s" on-fail="restart" \
> op promote interval="0s" timeout="100s" on-fail="restart" \
> op demote interval="0s" timeout="100s" on-fail="block" \
> op stop interval="0s" timeout="100s" on-fail="block"
> ms stateful dummy
> property $id="cib-bootstrap-options" \
> dc-version="1.0.12-066152e" \
> cluster-infrastructure="Heartbeat" \
> no-quorum-policy="ignore" \
> stonith-enabled="false" \
> startup-fencing="false" \
> crmd-transition-delay="2s"
> rsc_defaults $id="rsc-options" \
> resource-stickiness="INFINITY" \
> migration-threshold="1"
>
>
>
> 1) Initial status (dl380g5c=Master/dl380g5d=Slave)
> # crm_mon -1 -n
>
> ============
> Last updated: Thu Jan 10 18:25:17 2013
> Stack: Heartbeat
> Current DC: dl380g5d (3f6ec88d-ee47-4f63-bfeb-652b8dd96027) - partition with quorum
> Version: 1.0.12-066152e
> 2 Nodes configured, unknown expected votes
> 1 Resources configured.
> ============
>
> Node dl380g5c (21c624bd-c426-43dc-9665-bbfb92054bcd): online
> dummy:0 (ocf::pacemaker:Stateful) Master
> Node dl380g5d (3f6ec88d-ee47-4f63-bfeb-652b8dd96027): online
> dummy:1 (ocf::pacemaker:Stateful) Started
>
>
>
> 2) Modify Stateful RA to reprodece "demote NG", and put the Master node into standby mode.
>
> # vim /usr/lib/ocf/resource.d/pacemaker/Stateful
> stateful_demote() {
> return $OCF_ERR_GENERIC
>
> stateful_check_state
> if [ $? = 0 ]; then
> # CRM Error - Should never happen
> return $OCF_NOT_RUNNING
>
> ...
>
>
> # crm node standby dl380g5c
> # crm_mon -1 -n
> ============
> Last updated: Thu Jan 10 18:27:04 2013
> Stack: Heartbeat
> Current DC: dl380g5d (3f6ec88d-ee47-4f63-bfeb-652b8dd96027) - partition with quorum
> Version: 1.0.12-066152e
> 2 Nodes configured, unknown expected votes
> 1 Resources configured.
> ============
>
> Node dl380g5c (21c624bd-c426-43dc-9665-bbfb92054bcd): standby
> dummy:0 (ocf::pacemaker:Stateful) Slave (unmanaged) FAILED
> Node dl380g5d (3f6ec88d-ee47-4f63-bfeb-652b8dd96027): online
> dummy:1 (ocf::pacemaker:Stateful) Master
>
> Failed actions:
> dummy:0_demote_0 (node=dl380g5c, call=4, rc=1, status=complete): unknown error
>
>
> In the above crm_mon, dl380g5c's status is "Slave", but it might be still "Master" because it failed to demote.
> So dl380g5d should be prohibited from its promoting action to prevent the multiple Master.
> It seems that Pacemaker 1.1 shows the same behavior as 1.0.12.
> I'm not sure but Pacemaker 1.0.11's behavior is correct(dl380g5d can not promote).
> Please see the attached hb_report.
>
>
> Jan 10 18:27:01 dl380g5d pengine: [4297]: info: determine_online_status: Node dl380g5c is standby
> Jan 10 18:27:01 dl380g5d pengine: [4297]: info: determine_online_status: Node dl380g5d is online
> Jan 10 18:27:01 dl380g5d pengine: [4297]: notice: unpack_rsc_op: Operation dummy:0_monitor_0 found resource dummy:0 active in master mode on dl380g5c
> Jan 10 18:27:01 dl380g5d pengine: [4297]: WARN: unpack_rsc_op: Processing failed op dummy:0_demote_0 on dl380g5c: unknown error (1)
> Jan 10 18:27:01 dl380g5d pengine: [4297]: WARN: unpack_rsc_op: Forcing dummy:0 to stop after a failed demote action
> Jan 10 18:27:01 dl380g5d pengine: [4297]: info: native_add_running: resource dummy:0 isnt managed
> Jan 10 18:27:01 dl380g5d pengine: [4297]: notice: clone_print: Master/Slave Set: stateful
> Jan 10 18:27:01 dl380g5d pengine: [4297]: notice: native_print: dummy:0 (ocf::pacemaker:Stateful): Slave dl380g5c (unmanaged) FAILED
> Jan 10 18:27:01 dl380g5d pengine: [4297]: notice: short_print: Slaves: [ dl380g5d ]
> Jan 10 18:27:01 dl380g5d pengine: [4297]: info: get_failcount: stateful has failed 1 times on dl380g5c
> Jan 10 18:27:01 dl380g5d pengine: [4297]: WARN: common_apply_stickiness: Forcing stateful away from dl380g5c after 1 failures (max=1)
> Jan 10 18:27:01 dl380g5d pengine: [4297]: info: get_failcount: stateful has failed 1 times on dl380g5c
> Jan 10 18:27:01 dl380g5d pengine: [4297]: WARN: common_apply_stickiness: Forcing stateful away from dl380g5c after 1 failures (max=1)
> Jan 10 18:27:01 dl380g5d pengine: [4297]: info: native_color: Unmanaged resource dummy:0 allocated to 'nowhere': failed
> Jan 10 18:27:01 dl380g5d pengine: [4297]: info: master_color: Promoting dummy:1 (Slave dl380g5d)
> Jan 10 18:27:01 dl380g5d pengine: [4297]: info: master_color: stateful: Promoted 1 instances of a possible 1 to master
> Jan 10 18:27:01 dl380g5d pengine: [4297]: notice: RecurringOp: Start recurring monitor (10s) for dummy:1 on dl380g5d
> Jan 10 18:27:01 dl380g5d pengine: [4297]: notice: RecurringOp: Start recurring monitor (10s) for dummy:1 on dl380g5d
> Jan 10 18:27:01 dl380g5d pengine: [4297]: notice: LogActions: Leave resource dummy:0 (Slave unmanaged)
> Jan 10 18:27:01 dl380g5d pengine: [4297]: notice: LogActions: Promote dummy:1 (Slave -> Master dl380g5d)
>
>
>
> Best Regards,
> Junko IKEDA
>
> NTT DATA INTELLILINK CORPORATION
More information about the Pacemaker
mailing list