[ClusterLabs] systemd RA start/stop delays
Ken Gaillot
kgaillot at redhat.com
Thu Aug 18 16:00:24 CEST 2016
On 08/17/2016 08:17 PM, TEG AMJG wrote:
> Hi
>
> I am having a problem with a simple Active/Passive cluster which
> consists in the next configuration
>
> Cluster Name: kamcluster
> Corosync Nodes:
> kam1vs3 kam2vs3
> Pacemaker Nodes:
> kam1vs3 kam2vs3
>
> Resources:
> Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
> Attributes: ip=10.0.1.206 cidr_netmask=32
> Operations: start interval=0s timeout=20s (ClusterIP-start-interval-0s)
> stop interval=0s timeout=20s (ClusterIP-stop-interval-0s)
> monitor interval=10s (ClusterIP-monitor-interval-10s)
> Resource: ClusterIP2 (class=ocf provider=heartbeat type=IPaddr2)
> Attributes: ip=10.0.1.207 cidr_netmask=32
> Operations: start interval=0s timeout=20s (ClusterIP2-start-interval-0s)
> stop interval=0s timeout=20s (ClusterIP2-stop-interval-0s)
> monitor interval=10s (ClusterIP2-monitor-interval-10s)
> Resource: rtpproxycluster (class=systemd type=rtpproxy)
> Operations: monitor interval=10s (rtpproxycluster-monitor-interval-10s)
> stop interval=0s on-fail=block
> (rtpproxycluster-stop-interval-0s)
> Resource: kamailioetcfs (class=ocf provider=heartbeat type=Filesystem)
> Attributes: device=/dev/drbd1 directory=/etc/kamailio fstype=ext4
> Operations: start interval=0s timeout=60 (kamailioetcfs-start-interval-0s)
> monitor interval=10s on-fail=fence
> (kamailioetcfs-monitor-interval-1
> 0s)
> stop interval=0s on-fail=fence
> (kamailioetcfs-stop-interval-0s)
> Clone: fence_kam2_xvm-clone
> Meta Attrs: interleave=true clone-max=2 clone-node-max=1
> Resource: fence_kam2_xvm (class=stonith type=fence_xvm)
> Attributes: port=tegamjg_kam2 pcmk_host_list=kam2vs3
> Operations: monitor interval=60s (fence_kam2_xvm-monitor-interval-60s)
> Master: kamailioetcclone
> Meta Attrs: master-max=1 master-node-max=1 clone-max=2
> clone-node-max=1 notify=t
> rue on-fail=fence
> Resource: kamailioetc (class=ocf provider=linbit type=drbd)
> Attributes: drbd_resource=kamailioetc
> Operations: start interval=0s timeout=240 (kamailioetc-start-interval-0s)
> promote interval=0s on-fail=fence
> (kamailioetc-promote-interval-0s)
> demote interval=0s on-fail=fence
> (kamailioetc-demote-interval-0s)
> stop interval=0s on-fail=fence (kamailioetc-stop-interval-0s)
> monitor interval=10s (kamailioetc-monitor-interval-10s)
> Clone: fence_kam1_xvm-clone
> Meta Attrs: interleave=true clone-max=2 clone-node-max=1
> Resource: fence_kam1_xvm (class=stonith type=fence_xvm)
> Attributes: port=tegamjg_kam1 pcmk_host_list=kam1vs3
> Operations: monitor interval=60s (fence_kam1_xvm-monitor-interval-60s)
> Resource: kamailiocluster (class=ocf provider=heartbeat type=kamailio)
> Attributes: listen_address=10.0.1.206
> conffile=/etc/kamailio/kamailio.cfg pidfil
>
> e=/var/run/kamailio.pid monitoring_ip=10.0.1.206
> monitoring_ip2=10.0.1.207 port=50
> 60 proto=udp
> kamctlrc=/etc/kamailio/kamctlrc shmem=128 pkg=8
> Meta Attrs: target-role=Stopped
> Operations: start interval=0s timeout=60
> (kamailiocluster-start-interval-0s)
> stop interval=0s timeout=30 (kamailiocluster-stop-interval-0s)
> monitor interval=5s (kamailiocluster-monitor-interval-5s)
>
> Stonith Devices:
> Fencing Levels:
>
> Location Constraints:
> Ordering Constraints:
> start fence_kam1_xvm-clone then start fence_kam2_xvm-clone
> (kind:Mandatory) (id:
>
> order-fence_kam1_xvm-clone-fence_kam2_xvm-clone-mandatory)
> start fence_kam2_xvm-clone then promote kamailioetcclone
> (kind:Mandatory) (id:or
>
> der-fence_kam2_xvm-clone-kamailioetcclone-mandatory)
> promote kamailioetcclone then start kamailioetcfs (kind:Optional)
> (id:order-kama
> ilioetcclone-kamailioetcfs-Optional)
> Resource Sets:
> set kamailioetcfs sequential=true (id:pcs_rsc_set_kamailioetcfs) set
> ClusterIP
> ClusterIP2 sequential=false
> (id:pcs_rsc_set_ClusterIP_ClusterIP2) set rtpproxyclu
> ster
> kamailiocluster sequential=true
> (id:pcs_rsc_set_rtpproxycluster_kamailioclust
> er)
> (id:pcs_rsc_order_set_kamailioetcfs_set_ClusterIP_ClusterIP2_set_rtpproxyclust
>
> er_kamailiocluster)
> Colocation Constraints:
> rtpproxycluster with ClusterIP2 (score:INFINITY)
> (id:colocation-rtpproxycluster-
> ClusterIP2-INFINITY)
> ClusterIP2 with ClusterIP (score:INFINITY)
> (id:colocation-ClusterIP2-ClusterIP-I
> NFINITY)
> ClusterIP with kamailioetcfs (score:INFINITY)
> (id:colocation-ClusterIP-kamailioe
> tcfs-INFINITY)
> kamailioetcfs with kamailioetcclone (score:INFINITY)
> (with-rsc-role:Master) (id:
>
> colocation-kamailioetcfs-kamailioetcclone-INFINITY)
> fence_kam2_xvm-clone with fence_kam1_xvm-clone (score:INFINITY)
> (id:colocation-f
>
> ence_kam2_xvm-clone-fence_kam1_xvm-clone-INFINITY)
> kamailioetcclone with fence_kam2_xvm-clone (score:INFINITY)
> (id:colocation-kamai
> lioetcclone-fence_kam2_xvm-clone-INFINITY)
> kamailiocluster with rtpproxycluster (score:INFINITY)
> (id:colocation-kamailioclu
> ster-rtpproxycluster-INFINITY)
>
> Resources Defaults:
> migration-threshold: 2
> failure-timeout: 10m
> resource-stickiness: 200
> Operations Defaults:
> No defaults set
>
> Cluster Properties:
> cluster-infrastructure: corosync
> cluster-name: kamcluster
> dc-version: 1.1.13-10.el7_2.2-44eb2dd
> have-watchdog: false
> last-lrm-refresh: 1471479940
> no-quorum-policy: ignore
> start-failure-is-fatal: false
> stonith-action: reboot
> stonith-enabled: true
>
> Now my problem is when the rtpproxy systemd resource starts/stops, i am
> experiencing 2 seconds delay between the call from the start/stop signal
> is sent to systemd and the confirmation and interpretation of the action
> to the crmd. I am pretty sure that the initiation of this service doesnt
The problem is that systemd does not have a way to say "start this
service, and don't tell me it's done until it's completely started".
Systemd reports successful completion as soon as the start has been
initiated.
By contrast, Pacemaker can't proceed with dependencies until the service
is completely up. So, Pacemaker's workaround is to initiate the start,
then poll the status every 2 seconds until systemd reports the service
is completely up.
The 2 seconds is hardcoded, so there's no way to avoid that at the
moment, other than using an OCF agent.
> take even half of that time to start/stop. i did the same thing with a
> dummy service as a systemd resource and it als takes 2 seconds to
> start/stop the resource. Also, this is not happening in any of the ocf
> resources. I am putting what i see in the logs about this behaviour
>
> This is after doing "pcs resource restart rtpproxycluster"
>
> Aug 17 20:59:18 [13187] kam1 crmd: notice: te_rsc_command:
> Initiating action 14: stop rtpproxycluster_stop_0 on kam1vs3 (local)
> Aug 17 20:59:18 [13184] kam1 lrmd: info:
> cancel_recurring_action: Cancelling systemd operation
> rtpproxycluster_status_10000
> Aug 17 20:59:18 [13187] kam1 crmd: info: do_lrm_rsc_op:
> Performing key=14:30:0:8a202722-ece2-4617-b26e-8d4aa5f3522b
> op=rtpproxycluster_stop_0
> Aug 17 20:59:18 [13184] kam1 lrmd: info: log_execute:
> executing - rsc:rtpproxycluster action:stop call_id:106
> Aug 17 20:59:18 [13187] kam1 crmd: info: process_lrm_event:
> Operation rtpproxycluster_monitor_10000: Cancelled (node=kam1vs3,
> call=104, confirmed=true)
> Aug 17 20:59:18 [13184] kam1 lrmd: info: systemd_exec_result:
> Call to stop passed: /org/freedesktop/systemd1/job/8302
> Aug 17 20:59:20 [13187] kam1 crmd: notice: process_lrm_event:
> Operation rtpproxycluster_stop_0: ok (node=kam1vs3, call=106, rc=0,
> cib-update=134, confirmed=true)
> Aug 17 20:59:20 [13182] kam1 cib: info: cib_perform_op:
> +
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='rtpproxycluster']/lrm_rsc_op[@id='rtpproxycluster_last_0']:
> @operation_key=rtpproxycluster_stop_0, @operation=stop,
> @transition-key=14:30:0:8a202722-ece2-4617-b26e-8d4aa5f3522b,
> @transition-magic=0:0;14:30:0:8a202722-ece2-4617-b26e-8d4aa5f3522b,
> @call-id=106, @last-run=1471481958, @last-rc-change=1471481958,
> @exec-time=2116
>
> Aug 17 20:59:20 [13186] kam1 pengine: info: RecurringOp: Start
> recurring monitor (10s) for rtpproxycluster on kam1vs3
> Aug 17 20:59:20 [13186] kam1 pengine: notice: LogActions: Start
> rtpproxycluster (kam1vs3)
> Aug 17 20:59:20 [13187] kam1 crmd: notice: te_rsc_command:
> Initiating action 13: start rtpproxycluster_start_0 on kam1vs3 (local)
> Aug 17 20:59:20 [13187] kam1 crmd: info: do_lrm_rsc_op:
> Performing key=13:31:0:8a202722-ece2-4617-b26e-8d4aa5f3522b
> op=rtpproxycluster_start_0
> Aug 17 20:59:20 [13184] kam1 lrmd: info: log_execute:
> executing - rsc:rtpproxycluster action:start call_id:107
> Aug 17 20:59:21 [13184] kam1 lrmd: info: systemd_exec_result:
> Call to start passed: /org/freedesktop/systemd1/job/8303
> Aug 17 20:59:23 [13187] kam1 crmd: notice: process_lrm_event:
> Operation rtpproxycluster_start_0: ok (node=kam1vs3, call=107, rc=0,
> cib-update=136, confirmed=true)
> Aug 17 20:59:23 [13182] kam1 cib: info: cib_perform_op:
> +
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='rtpproxycluster']/lrm_rsc_op[@id='rtpproxycluster_last_0']:
> @operation_key=rtpproxycluster_start_0, @operation=start,
> @transition-key=13:31:0:8a202722-ece2-4617-b26e-8d4aa5f3522b,
> @transition-magic=0:0;13:31:0:8a202722-ece2-4617-b26e-8d4aa5f3522b,
> @call-id=107, @last-run=1471481960, @last-rc-change=1471481960,
> @exec-time=2068
>
> Why is this happening?, many of my resources depends on each other, so
> the delay it takes when failover its needed is quite a lot (just 4
> seconds of delay stopping/starting just for that systemd resource.
>
> Is this normal?, is there anyway to solve this?, Do i really need to
> make an OCF resource for that service (that would take some time that i
> dont really have at this moment)
>
> Regards
>
> Alejandro
More information about the Users
mailing list