[ClusterLabs] systemd RA start/stop delays
Klaus Wenninger
kwenning at redhat.com
Thu Aug 18 15:58:12 UTC 2016
On 08/18/2016 04:00 PM, Ken Gaillot wrote:
> On 08/17/2016 08:17 PM, TEG AMJG wrote:
>> Hi
>>
>> I am having a problem with a simple Active/Passive cluster which
>> consists in the next configuration
>>
>> Cluster Name: kamcluster
>> Corosync Nodes:
>> kam1vs3 kam2vs3
>> Pacemaker Nodes:
>> kam1vs3 kam2vs3
>>
>> Resources:
>> Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
>> Attributes: ip=10.0.1.206 cidr_netmask=32
>> Operations: start interval=0s timeout=20s (ClusterIP-start-interval-0s)
>> stop interval=0s timeout=20s (ClusterIP-stop-interval-0s)
>> monitor interval=10s (ClusterIP-monitor-interval-10s)
>> Resource: ClusterIP2 (class=ocf provider=heartbeat type=IPaddr2)
>> Attributes: ip=10.0.1.207 cidr_netmask=32
>> Operations: start interval=0s timeout=20s (ClusterIP2-start-interval-0s)
>> stop interval=0s timeout=20s (ClusterIP2-stop-interval-0s)
>> monitor interval=10s (ClusterIP2-monitor-interval-10s)
>> Resource: rtpproxycluster (class=systemd type=rtpproxy)
>> Operations: monitor interval=10s (rtpproxycluster-monitor-interval-10s)
>> stop interval=0s on-fail=block
>> (rtpproxycluster-stop-interval-0s)
>> Resource: kamailioetcfs (class=ocf provider=heartbeat type=Filesystem)
>> Attributes: device=/dev/drbd1 directory=/etc/kamailio fstype=ext4
>> Operations: start interval=0s timeout=60 (kamailioetcfs-start-interval-0s)
>> monitor interval=10s on-fail=fence
>> (kamailioetcfs-monitor-interval-1
>> 0s)
>> stop interval=0s on-fail=fence
>> (kamailioetcfs-stop-interval-0s)
>> Clone: fence_kam2_xvm-clone
>> Meta Attrs: interleave=true clone-max=2 clone-node-max=1
>> Resource: fence_kam2_xvm (class=stonith type=fence_xvm)
>> Attributes: port=tegamjg_kam2 pcmk_host_list=kam2vs3
>> Operations: monitor interval=60s (fence_kam2_xvm-monitor-interval-60s)
>> Master: kamailioetcclone
>> Meta Attrs: master-max=1 master-node-max=1 clone-max=2
>> clone-node-max=1 notify=t
>> rue on-fail=fence
>> Resource: kamailioetc (class=ocf provider=linbit type=drbd)
>> Attributes: drbd_resource=kamailioetc
>> Operations: start interval=0s timeout=240 (kamailioetc-start-interval-0s)
>> promote interval=0s on-fail=fence
>> (kamailioetc-promote-interval-0s)
>> demote interval=0s on-fail=fence
>> (kamailioetc-demote-interval-0s)
>> stop interval=0s on-fail=fence (kamailioetc-stop-interval-0s)
>> monitor interval=10s (kamailioetc-monitor-interval-10s)
>> Clone: fence_kam1_xvm-clone
>> Meta Attrs: interleave=true clone-max=2 clone-node-max=1
>> Resource: fence_kam1_xvm (class=stonith type=fence_xvm)
>> Attributes: port=tegamjg_kam1 pcmk_host_list=kam1vs3
>> Operations: monitor interval=60s (fence_kam1_xvm-monitor-interval-60s)
>> Resource: kamailiocluster (class=ocf provider=heartbeat type=kamailio)
>> Attributes: listen_address=10.0.1.206
>> conffile=/etc/kamailio/kamailio.cfg pidfil
>>
>> e=/var/run/kamailio.pid monitoring_ip=10.0.1.206
>> monitoring_ip2=10.0.1.207 port=50
>> 60 proto=udp
>> kamctlrc=/etc/kamailio/kamctlrc shmem=128 pkg=8
>> Meta Attrs: target-role=Stopped
>> Operations: start interval=0s timeout=60
>> (kamailiocluster-start-interval-0s)
>> stop interval=0s timeout=30 (kamailiocluster-stop-interval-0s)
>> monitor interval=5s (kamailiocluster-monitor-interval-5s)
>>
>> Stonith Devices:
>> Fencing Levels:
>>
>> Location Constraints:
>> Ordering Constraints:
>> start fence_kam1_xvm-clone then start fence_kam2_xvm-clone
>> (kind:Mandatory) (id:
>>
>> order-fence_kam1_xvm-clone-fence_kam2_xvm-clone-mandatory)
>> start fence_kam2_xvm-clone then promote kamailioetcclone
>> (kind:Mandatory) (id:or
>>
>> der-fence_kam2_xvm-clone-kamailioetcclone-mandatory)
>> promote kamailioetcclone then start kamailioetcfs (kind:Optional)
>> (id:order-kama
>> ilioetcclone-kamailioetcfs-Optional)
>> Resource Sets:
>> set kamailioetcfs sequential=true (id:pcs_rsc_set_kamailioetcfs) set
>> ClusterIP
>> ClusterIP2 sequential=false
>> (id:pcs_rsc_set_ClusterIP_ClusterIP2) set rtpproxyclu
>> ster
>> kamailiocluster sequential=true
>> (id:pcs_rsc_set_rtpproxycluster_kamailioclust
>> er)
>> (id:pcs_rsc_order_set_kamailioetcfs_set_ClusterIP_ClusterIP2_set_rtpproxyclust
>>
>> er_kamailiocluster)
>> Colocation Constraints:
>> rtpproxycluster with ClusterIP2 (score:INFINITY)
>> (id:colocation-rtpproxycluster-
>> ClusterIP2-INFINITY)
>> ClusterIP2 with ClusterIP (score:INFINITY)
>> (id:colocation-ClusterIP2-ClusterIP-I
>> NFINITY)
>> ClusterIP with kamailioetcfs (score:INFINITY)
>> (id:colocation-ClusterIP-kamailioe
>> tcfs-INFINITY)
>> kamailioetcfs with kamailioetcclone (score:INFINITY)
>> (with-rsc-role:Master) (id:
>>
>> colocation-kamailioetcfs-kamailioetcclone-INFINITY)
>> fence_kam2_xvm-clone with fence_kam1_xvm-clone (score:INFINITY)
>> (id:colocation-f
>>
>> ence_kam2_xvm-clone-fence_kam1_xvm-clone-INFINITY)
>> kamailioetcclone with fence_kam2_xvm-clone (score:INFINITY)
>> (id:colocation-kamai
>> lioetcclone-fence_kam2_xvm-clone-INFINITY)
>> kamailiocluster with rtpproxycluster (score:INFINITY)
>> (id:colocation-kamailioclu
>> ster-rtpproxycluster-INFINITY)
>>
>> Resources Defaults:
>> migration-threshold: 2
>> failure-timeout: 10m
>> resource-stickiness: 200
>> Operations Defaults:
>> No defaults set
>>
>> Cluster Properties:
>> cluster-infrastructure: corosync
>> cluster-name: kamcluster
>> dc-version: 1.1.13-10.el7_2.2-44eb2dd
>> have-watchdog: false
>> last-lrm-refresh: 1471479940
>> no-quorum-policy: ignore
>> start-failure-is-fatal: false
>> stonith-action: reboot
>> stonith-enabled: true
>>
>> Now my problem is when the rtpproxy systemd resource starts/stops, i am
>> experiencing 2 seconds delay between the call from the start/stop signal
>> is sent to systemd and the confirmation and interpretation of the action
>> to the crmd. I am pretty sure that the initiation of this service doesnt
> The problem is that systemd does not have a way to say "start this
> service, and don't tell me it's done until it's completely started".
> Systemd reports successful completion as soon as the start has been
> initiated.
First sorry for my previous guessing in the wrong direction ...
But anyway with 'systemctl' the waiting for a service to daemonize
(type = forking) or for it to return status via DBus (type = notify)
seems to work ...
Haven't checked so far if this is a useful source of information but
that is how 'systemctl' seems to do the waiting:
https://github.com/systemd/systemd/blob/master/src/systemctl/systemctl.c#L2838
https://github.com/systemd/systemd/blob/master/src/shared/bus-unit-util.c#L839
https://github.com/systemd/systemd/blob/master/src/shared/bus-unit-util.c#L788
I know that the status-query for certain-daemons would fail if issued
directly after they have daemonized. And of course the concept
can't work with all types of units (e.g. type = simple).
Thus the status-checking after StartUnit could still be done with a
configurable delay (via an attribute defaulting to the 2 seconds).
>
> By contrast, Pacemaker can't proceed with dependencies until the service
> is completely up. So, Pacemaker's workaround is to initiate the start,
> then poll the status every 2 seconds until systemd reports the service
> is completely up.
>
> The 2 seconds is hardcoded, so there's no way to avoid that at the
> moment, other than using an OCF agent.
>
>> take even half of that time to start/stop. i did the same thing with a
>> dummy service as a systemd resource and it als takes 2 seconds to
>> start/stop the resource. Also, this is not happening in any of the ocf
>> resources. I am putting what i see in the logs about this behaviour
>>
>> This is after doing "pcs resource restart rtpproxycluster"
>>
>> Aug 17 20:59:18 [13187] kam1 crmd: notice: te_rsc_command:
>> Initiating action 14: stop rtpproxycluster_stop_0 on kam1vs3 (local)
>> Aug 17 20:59:18 [13184] kam1 lrmd: info:
>> cancel_recurring_action: Cancelling systemd operation
>> rtpproxycluster_status_10000
>> Aug 17 20:59:18 [13187] kam1 crmd: info: do_lrm_rsc_op:
>> Performing key=14:30:0:8a202722-ece2-4617-b26e-8d4aa5f3522b
>> op=rtpproxycluster_stop_0
>> Aug 17 20:59:18 [13184] kam1 lrmd: info: log_execute:
>> executing - rsc:rtpproxycluster action:stop call_id:106
>> Aug 17 20:59:18 [13187] kam1 crmd: info: process_lrm_event:
>> Operation rtpproxycluster_monitor_10000: Cancelled (node=kam1vs3,
>> call=104, confirmed=true)
>> Aug 17 20:59:18 [13184] kam1 lrmd: info: systemd_exec_result:
>> Call to stop passed: /org/freedesktop/systemd1/job/8302
>> Aug 17 20:59:20 [13187] kam1 crmd: notice: process_lrm_event:
>> Operation rtpproxycluster_stop_0: ok (node=kam1vs3, call=106, rc=0,
>> cib-update=134, confirmed=true)
>> Aug 17 20:59:20 [13182] kam1 cib: info: cib_perform_op:
>> +
>> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='rtpproxycluster']/lrm_rsc_op[@id='rtpproxycluster_last_0']:
>> @operation_key=rtpproxycluster_stop_0, @operation=stop,
>> @transition-key=14:30:0:8a202722-ece2-4617-b26e-8d4aa5f3522b,
>> @transition-magic=0:0;14:30:0:8a202722-ece2-4617-b26e-8d4aa5f3522b,
>> @call-id=106, @last-run=1471481958, @last-rc-change=1471481958,
>> @exec-time=2116
>>
>> Aug 17 20:59:20 [13186] kam1 pengine: info: RecurringOp: Start
>> recurring monitor (10s) for rtpproxycluster on kam1vs3
>> Aug 17 20:59:20 [13186] kam1 pengine: notice: LogActions: Start
>> rtpproxycluster (kam1vs3)
>> Aug 17 20:59:20 [13187] kam1 crmd: notice: te_rsc_command:
>> Initiating action 13: start rtpproxycluster_start_0 on kam1vs3 (local)
>> Aug 17 20:59:20 [13187] kam1 crmd: info: do_lrm_rsc_op:
>> Performing key=13:31:0:8a202722-ece2-4617-b26e-8d4aa5f3522b
>> op=rtpproxycluster_start_0
>> Aug 17 20:59:20 [13184] kam1 lrmd: info: log_execute:
>> executing - rsc:rtpproxycluster action:start call_id:107
>> Aug 17 20:59:21 [13184] kam1 lrmd: info: systemd_exec_result:
>> Call to start passed: /org/freedesktop/systemd1/job/8303
>> Aug 17 20:59:23 [13187] kam1 crmd: notice: process_lrm_event:
>> Operation rtpproxycluster_start_0: ok (node=kam1vs3, call=107, rc=0,
>> cib-update=136, confirmed=true)
>> Aug 17 20:59:23 [13182] kam1 cib: info: cib_perform_op:
>> +
>> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='rtpproxycluster']/lrm_rsc_op[@id='rtpproxycluster_last_0']:
>> @operation_key=rtpproxycluster_start_0, @operation=start,
>> @transition-key=13:31:0:8a202722-ece2-4617-b26e-8d4aa5f3522b,
>> @transition-magic=0:0;13:31:0:8a202722-ece2-4617-b26e-8d4aa5f3522b,
>> @call-id=107, @last-run=1471481960, @last-rc-change=1471481960,
>> @exec-time=2068
>>
>> Why is this happening?, many of my resources depends on each other, so
>> the delay it takes when failover its needed is quite a lot (just 4
>> seconds of delay stopping/starting just for that systemd resource.
>>
>> Is this normal?, is there anyway to solve this?, Do i really need to
>> make an OCF resource for that service (that would take some time that i
>> dont really have at this moment)
>>
>> Regards
>>
>> Alejandro
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list