[ClusterLabs] Antw: monitor operation for ASTERISK on node_name: 7 (not running)
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Wed Nov 8 10:52:38 EST 2017
Hi!
It may be worth the time to find out which command is used to monitor the status of asterisk.
Then maybe run that command repeatedly in a shell loop to find out what it returns.
I guess some thing is slow to respond; maybe jsut increase the timeout...
Regards,
Ulrich
>>> Donat Zenichev <donat.zenichev at gmail.com> schrieb am 07.11.2017 um 17:07 in
Nachricht
<CANLwQCnVpPpM7s7kzH0P42WXyQRQH62Wm3VNZF8gEa0YuH09ew at mail.gmail.com>:
> Hi guys.
> I've just started to work with pacemaker and have a problem with monitored
> service.
>
> I've already configured three Active/Stand-by clusters with pacemaker.
> Running resources:
> IPaddr2
> asterisk daemon
> bacula fd
> snmp daemon
>
> First and second cluster are working fine - I didn't notice any failures.
> But the third cluster fails too frequently.
>
> Schema is similar for all clusters:
> Master (active) <-corosync-> Slave(Stand-by)
>
> There is no difference between three cluster, apart server loading.
> Asterisk that is running on third cluster has 500+ customers and processes
> much more calls than others.
>
> So that, cluster periodically thinks that asterisk is not running:
> * ASTERISK_monitor_2000 on node1-Master 'not running' (7): call=85,
> status=complete, exitreason='none',
> last-rc-change='Tue Nov 7 15:06:16 2017', queued=0ms, exec=0ms
>
> And restarts it (because on-fail=restart parameter for asterisk primitive).
> But indeed asterisk is working fine and nothing happens with him.
> I parsed asterisk full log and found nothing, that can explain the behavior
> of pacemaker.
>
> All machines are virtual (not containers, but proxmox VMs). They have
> enough resources, each has - 8 cores 3GHz, 8GB ram.
> I tried to increase resources on machines - I doubled them up, but it
> changed nothing.
> And it seemed to be that machine resources are not the root of the problem,
> resources monitoring showed that cores are not loaded more than 10%.
>
> Configurations.
>
> Corosync config:
> totem {
> version: 2
> cluster_name: asterisk
>
> token: 1000
> token_retransmit: 31
> hold: 31
>
> token_retransmits_before_loss_const: 0
>
> clear_node_high_bit: yes
>
> crypto_cipher: none
>
> crypto_hash: none
> rrp_mode: active
> transport: udpu
>
> interface {
> member {
> memberaddr: 10.100.1.1
> }
> member {
> memberaddr: 10.100.1.2
> }
>
> ringnumber: 0
> bindnetaddr: 10.100.1.1
>
> mcastport: 5405
> ttl: 1
> }
> }
>
> quorum {
> provider: corosync_votequorum
> expected_votes: 2
> }
>
> logging block is skipped.
>
>
>
> Pacemaker config:
>
> node 178676749: node1-Master
> node 178676750: node2-Slave
> primitive ASTERISK systemd:asterisk \
> op monitor interval=2s timeout=30s on-fail=restart \
> op start interval=0 timeout=30s \
> op stop interval=0 timeout=30s \
> meta migration-threshold=2 failure-timeout=1800s target-role=Started
> primitive BACULA systemd:bacula-fd \
> op monitor interval=30s timeout=60s on-fail=restart \
> op start interval=0 timeout=30s \
> op stop interval=0 timeout=30s \
> meta migration-threshold=2 failure-timeout=1800s
> primitive IPSHARED IPaddr2 \
> params ip=here.my.real.ip.address nic=ens18 cidr_netmask=29 \
> meta migration-threshold=2 target-role=Started \
> op monitor interval=20 timeout=60 on-fail=restart
> primitive SNMP systemd:snmpd \
> op monitor interval=30s timeout=60s on-fail=restart \
> op start interval=0 timeout=30s \
> op stop interval=0 timeout=30s \
> meta migration-threshold=2 failure-timeout=1800s target-role=Started
> order ASTERISK_AFTER_IPSHARED Mandatory: IPSHARED ASTERISK SNMP
> colocation ASTERISK_WITH_IPSHARED inf: ASTERISK IPSHARED
> location PREFER_BACULA BACULA 100: node1-Master
> location PREFER_MASTER ASTERISK 100: node1-Master
> location PREFER_SNMP SNMP 100: node1-Master
> property cib-bootstrap-options: \
> cluster-recheck-interval=5s \
> start-failure-is-fatal=false \
> stonith-enabled=false \
> no-quorum-policy=ignore \
> have-watchdog=false \
> dc-version=1.1.16-94ff4df \
> cluster-infrastructure=corosync \
> cluster-name=virtual2
>
>
> Asterisk systemd config:
>
> [Unit]
> Description=Asterisk
>
> [Service]
> ExecStart=/etc/init.d/asterisk start
> ExecStop=/etc/init.d/asterisk stop
> PIDFile=/var/run/asterisk/asterisk.pid
>
>
>
> Corosync log:
>
> Nov 07 15:06:16 [3958] node1-Master crmd: info:
> process_lrm_event: Result
> of monitor operation for ASTERISK on node1-Master: 7 (not running) |
> call=85 key=ASTERISK_monitor_2000 confirmed=false cib-update=106
> Nov 07 15:06:16 [3953] node1-Master cib: info:
> cib_process_request: Forwarding cib_modify operation for section status to
> all (origin=local/crmd/106)
> Nov 07 15:06:16 [3953] node1-Master cib: info: cib_perform_op:
> Diff:
> --- 0.38.37 2
> Nov 07 15:06:16 [3953] node1-Master cib: info: cib_perform_op:
> Diff:
> +++ 0.38.38 (null)
> Nov 07 15:06:16 [3953] node1-Master cib: info: cib_perform_op: +
> /cib: @num_updates=38
> Nov 07 15:06:16 [3953] node1-Master cib: info: cib_perform_op: +
> /cib/status/node_state[@id='178676749']/lrm[@id='178676749']/lrm_resources/l
> rm_resource[@id='ASTERISK']/lrm_rsc_op[@id='ASTERISK_last_failure_0']:
> @transition-key=2:29672:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934,
> @transition-magic=0:7;2:29672:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934,
> @call-id=85, @last-rc-change=1510059976
> Nov 07 15:06:16 [3953] node1-Master cib: info:
> cib_process_request: Completed cib_modify operation for section status: OK
> (rc=0, origin=node1-Master/crmd/106, version=0.38.38)
> Nov 07 15:06:16 [3956] node1-Master attrd: info:
> attrd_peer_update: Setting
> fail-count-ASTERISK[node1-Master]: 6 -> 7 from node2-Slave
> Nov 07 15:06:16 [3956] node1-Master attrd: info:
> attrd_peer_update: Setting
> last-failure-ASTERISK[node1-Master]: 1510059507 -> 1510059976 from
> node2-Slave
> Nov 07 15:06:16 [3955] node1-Master lrmd: info:
> cancel_recurring_action: Cancelling systemd operation SNMP_status_30000
> Nov 07 15:06:16 [3958] node1-Master crmd: info:
> do_lrm_rsc_op: Performing
> key=10:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934 op=SNMP_stop_0
> Nov 07 15:06:16 [3955] node1-Master lrmd: info: log_execute:
> executing
> - rsc:SNMP action:stop call_id:89
> Nov 07 15:06:16 [3958] node1-Master crmd: info:
> process_lrm_event: Result
> of monitor operation for SNMP on node1-Master: Cancelled | call=87
> key=SNMP_monitor_30000 confirmed=true
> Nov 07 15:06:16 [3955] node1-Master lrmd: info:
> systemd_exec_result: Call to stop passed:
> /org/freedesktop/systemd1/job/10916
> Nov 07 15:06:18 [3958] node1-Master crmd: notice:
> process_lrm_event: Result
> of stop operation for SNMP on node1-Master: 0 (ok) | call=89
> key=SNMP_stop_0 confirmed=true cib-update=107
> Nov 07 15:06:18 [3953] node1-Master cib: info:
> cib_process_request: Forwarding cib_modify operation for section status to
> all (origin=local/crmd/107)
> Nov 07 15:06:18 [3953] node1-Master cib: info: cib_perform_op:
> Diff:
> --- 0.38.38 2
> Nov 07 15:06:18 [3953] node1-Master cib: info: cib_perform_op:
> Diff:
> +++ 0.38.39 (null)
> Nov 07 15:06:18 [3953] node1-Master cib: info: cib_perform_op: +
> /cib: @num_updates=39
> Nov 07 15:06:18 [3953] node1-Master cib: info: cib_perform_op: +
> /cib/status/node_state[@id='178676749']/lrm[@id='178676749']/lrm_resources/l
> rm_resource[@id='SNMP']/lrm_rsc_op[@id='SNMP_last_0']:
> @operation_key=SNMP_stop_0, @operation=stop,
> @transition-key=10:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934,
> @transition-magic=0:0;10:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934,
> @call-id=89, @last-run=1510059976, @last-rc-change=1510059976,
> @exec-time=2047
> Nov 07 15:06:18 [3953] node1-Master cib: info:
> cib_process_request: Completed cib_modify operation for section status: OK
> (rc=0, origin=node1-Master/crmd/107, version=0.38.39)
> Nov 07 15:06:18 [3955] node1-Master lrmd: info:
> cancel_recurring_action: Cancelling systemd operation ASTERISK_status_2000
> Nov 07 15:06:18 [3958] node1-Master crmd: info:
> do_lrm_rsc_op: Performing
> key=3:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934 op=ASTERISK_stop_0
> Nov 07 15:06:18 [3955] node1-Master lrmd: info: log_execute:
> executing
> - rsc:ASTERISK action:stop call_id:91
> Nov 07 15:06:18 [3958] node1-Master crmd: info:
> process_lrm_event: Result
> of monitor operation for ASTERISK on node1-Master: Cancelled | call=85
> key=ASTERISK_monitor_2000 confirmed=true
> Nov 07 15:06:18 [3955] node1-Master lrmd: info:
> systemd_exec_result: Call to stop passed:
> /org/freedesktop/systemd1/job/10917
>
>
>
> Asterisk with the same asterisk's configurations works fine on regular
> virtual machine (not cluster), with the same resource parameters.
> So I think, the problem consists of interaction between asterisk monitor
> (pacemaker) function and asterisk daemon. May be delays or something like.
>
> Thanks in advance for answers/hints.
>
>
> --
> --
> BR, Donat Zenichev
> Wnet VoIP team
> Tel: +380(44) 5-900-808
> http://wnet.ua
More information about the Users
mailing list