[Pacemaker] pacemaker shutdown waits for a failover
Andrew Beekhof
andrew at beekhof.net
Thu Jul 31 23:09:35 CEST 2014
On 31 Jul 2014, at 8:20 pm, Liron Amitzi <LironA at imperva.com> wrote:
>>> When I run "service pacemaker stop" it takes a long time, I see that it stops all the resources, then starts them on the other node, and only then the "stop" command is completed.
>>
>> Ahhh! It was the DC.
>>
>> It appears to be deliberate, I found this commit from 2008 where the behaviour was introduced:
>> https://github.com/beekhof/pacemaker/commit/7bf55f0
>>
>> I could change it, but I'm no longer sure this would be a good idea as it would increase service downtime.
>> (Electing and bootstrapping a new DC introduces additional delays before the cluster can bring up any resources).
>>
>> I assume there is a particular resource that takes a long time to start?
>>
> Yes, mainly the JavaSrv takes quite a lot of time...
Do you have any resources that need to start after JavaSrv?
If not there might be some magic you can use...
> So you say this is by design since the server I'm rebooting is the DC, and I suffer because my resources take long time to start?
Essentially, yes.
> Got it, thanks a lot for your response.
>
>>
>>> I have 3 resources, IP, OracleDB and JavaSrv
>>>
>>> This is the output on the screen:
>>> [root at ha1 ~]# service pacemaker stop
>>> Signaling Pacemaker Cluster Manager to terminate: [ OK ]
>>> Waiting for cluster services to >unload:.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... [ OK ]
>>> [root at ha1 ~]#
>>>
>>> And these are parts of the log (/var/log/cluster/corosync.log):
>>> Jun 29 15:14:15 [28031] ha1 pengine: notice: stage6: Scheduling Node ha1 for shutdown
>>> Jun 29 15:14:15 [28031] ha1 pengine: notice: LogActions: Move ip_resource (Started ha1 -> ha2)
>>> Jun 29 15:14:15 [28031] ha1 pengine: notice: LogActions: Move OracleDB (Started ha1 -> ha2)
>>> Jun 29 15:14:15 [28031] ha1 pengine: notice: LogActions: Move JavaSrv (Started ha1 -> ha2)
>>> Jun 29 15:14:15 [28032] ha1 crmd: info: te_rsc_command: Initiating action 12: stop JavaSrv_stop_0 on ha1 (local)
>>> Jun 29 15:14:15 ha1 lrmd: [28029]: info: rsc:JavaSrv:16: stop
>>> ...
>>> Jun 29 15:14:41 [28032] ha1 crmd: info: process_lrm_event: LRM operation JavaSrv_stop_0 (call=16, rc=0, cib-update=447, confirmed=true) ok
>>> Jun 29 15:14:41 [28032] ha1 crmd: info: te_rsc_command: Initiating action 9: stop OracleDB_stop_0 on ha1 (local)
>>> Jun 29 15:14:41 ha1 lrmd: [28029]: info: cancel_op: operation monitor[13] on lsb::ha-dbora::OracleDB for client 28032, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.6] CRM_meta_timeout=[600000] CRM_meta_interval=[60000] cancelled
>>> Jun 29 15:14:41 ha1 lrmd: [28029]: info: rsc:OracleDB:17: stop
>>> ...
>>> Jun 29 15:15:08 [28032] ha1 crmd: info: process_lrm_event: LRM operation OracleDB_stop_0 (call=17, rc=0, cib-update=448, confirmed=true) ok
>>> Jun 29 15:15:08 [28032] ha1 crmd: info: te_rsc_command: Initiating action 7: stop ip_resource_stop_0 on ha1 (local)
>>> ...
>>> Jun 29 15:15:08 [28032] ha1 crmd: info: process_lrm_event: LRM operation ip_resource_stop_0 (call=18, rc=0, cib-update=449, confirmed=true) ok
>>> Jun 29 15:15:08 [28032] ha1 crmd: info: te_rsc_command: Initiating action 8: start ip_resource_start_0 on ha2
>>> Jun 29 15:15:08 [28032] ha1 crmd: info: te_crm_command: Executing crm-event (21): do_shutdown on ha1
>>> Jun 29 15:15:08 [28032] ha1 crmd: info: te_crm_command: crm-event (21) is a local shutdown
>>> Jun 29 15:15:09 [28032] ha1 crmd: info: te_rsc_command: Initiating action 10: start OracleDB_start_0 on ha2
>>> Jun 29 15:15:51 [28032] ha1 crmd: info: te_rsc_command: Initiating action 11: monitor OracleDB_monitor_60000 on ha2
>>> Jun 29 15:15:51 [28032] ha1 crmd: info: te_rsc_command: Initiating action 13: start JavaSrv_start_0 on ha2
>>> ...
>>> Jun 29 15:27:09 [28023] ha1 pacemakerd: info: pcmk_child_exit: Child process cib exited (pid=28027, rc=0)
>>> Jun 29 15:27:09 [28023] ha1 pacemakerd: notice: pcmk_shutdown_worker: Shutdown complete
>>> Jun 29 15:27:09 [28023] ha1 pacemakerd: info: main: Exiting pacemakerd
>>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20140801/8254c214/attachment.sig>
More information about the Pacemaker
mailing list