[Pacemaker] ClusterMon Resource starting multiple instances of crm_mon
Steven Bambling
smbambling at arin.net
Fri May 10 10:08:18 UTC 2013
On May 10, 2013, at 5:35 AM, Steven Bambling <smbambling at arin.net> wrote:
>
> On May 9, 2013, at 8:05 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
>
>>
>> On 10/05/2013, at 12:40 AM, Steven Bambling <smbambling at arin.net> wrote:
>>
>>> I'm having some issues with getting some cluster monitoring setup and configured on a 3 node multi-state cluster. I'm using Florian's blog as an example http://floriancrouzat.net/2013/01/monitor-a-pacemaker-cluster-with-ocfpacemakerclustermon-andor-external-agent/.
>>>
>>> When I create the primitive resource it starts on one of my nodes but spawns multiple instances of crm_mon. I don't see any reason that would cause it to spawn multiple instances, its very odd behavior.
>>
>> If you run:
>>
>> /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
>>
>> manually a few times, what happens? Multiple processes?
>
> Yep for some reason its spawning multiple processes.
>
> root at pgdb3 ~]# ps aux | grep crm_mon
> root 30678 0.0 0.0 103244 856 pts/0 S+ 05:30 0:00 grep crm_mon
> [root at pgdb3 ~]# /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
> [root at pgdb3 ~]# /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
> [root at pgdb3 ~]# /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
> [root at pgdb3 ~]# /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
> [root at pgdb3 ~]# ps aux | grep crm_mon
> root 30772 0.0 0.0 82744 2816 ? S 05:30 0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
> root 30781 0.0 0.0 82744 2668 ? S 05:30 0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
> root 30784 0.0 0.0 82744 2476 ? S 05:30 0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
> root 31134 0.0 0.0 103244 856 pts/0 S+ 05:30 0:00 grep crm_mon
>
> Put the .pid file in the tmp dir only lists 1 pid
> [root at pgdb3 ~]# cat /tmp/ClusterMon_SNMPMon.pid
> 30772
I take that back I doubled checked and the SNMPMon resource was still started which was creating the multiple processes. After I stopped the resource I pkill'd all the crm_mon process and then ran the command again manually. Now it seems to squash the additional processes and only allows 1 process to be running.
[root at pgdb3 tmp]# ps aux | grep crm_mon
root 30955 0.0 0.0 82492 2632 pts/0 S 06:05 0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
root 31991 0.0 0.0 103244 852 pts/0 S+ 06:05 0:00 grep crm_mon
[root at pgdb3 tmp]# /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
[root at pgdb3 tmp]# /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
[root at pgdb3 tmp]# /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
[root at pgdb3 tmp]# /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
[root at pgdb3 tmp]# ps aux | grep crm_mon
root 30955 0.0 0.0 82492 2632 pts/0 S 06:05 0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
root 32545 0.0 0.0 103244 856 pts/0 S+ 06:06 0:00 grep crm_mon
STEVE
>
>>
>>>
>>> I was also looking for some clarification on what this resource provides….it looks to me that it kicks off a crm_mon in daemon mode that will update a .html file and with -E it will run an external script. But the resource itself doesn't trigger anything if another resource changes state only if the crm_mon process ( monitored with PID ) fails and it has to restart.
>>
>> Correct, it just updates the html file which you can see in your browser.
>> Or, with -E, it can send an email or snmp alert.
>>
>>> If this is correct what is the best practice for monitoring additional resource states?
>>
>> Define "additional"?
>> If the resource fails we'll normally recover it automatically.
> An example of an additional resource would be a vip using ( IPaddr2 ). Also I have a multi-state pgsql resource, so if the resource fails it will either try to restart or promote another node in the cluster to Master.
>
> v/r
>
> STEVE
>
>>
>>>
>>> v/r
>>>
>>> STEVE
>>>
>>>
>>> Below are some additional data points.
>>>
>>>
>>> Creating the Resource
>>>
>>> [root at pgdb2 tmp]# crm configure primitive SNMPMon ocf:pacemaker:ClusterMon \
>>>> params user="root" update="30" extra_options="-E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net" \
>>>> op monitor on-fail="restart" interval="60"
>>>
>>>
>>> Manual crm_mon output
>>>
>>> Last updated: Thu May 9 10:24:30 2013
>>> Last change: Thu May 9 10:20:49 2013 via cibadmin on pgdb2.example.com
>>> Stack: cman
>>> Current DC: pgdb1.example.com - partition with quorum
>>> Version: 1.1.8-7.el6-394e906
>>> 3 Nodes configured, unknown expected votes
>>> 6 Resources configured.
>>>
>>>
>>> Node pgdb1.example.com: standby
>>> Online: [ pgdb2.example.com pgdb3.example.com ]
>>>
>>> PG_REP_VIP (ocf::heartbeat:IPaddr2): Started pgdb2.example.com
>>> PG_CLI_VIP (ocf::heartbeat:IPaddr2): Started pgdb2.example.com
>>> Master/Slave Set: msPGSQL [PGSQL]
>>> Masters: [ pgdb2.example.com ]
>>> Slaves: [ pgdb3.example.com ]
>>> Stopped: [ PGSQL:2 ]
>>> SNMPMon (ocf::pacemaker:ClusterMon): Started pgdb3.example.com
>>>
>>> PS to check for process on pgdb3
>>>
>>> [root at pgdb3 tmp]# ps aux | grep crm_mon
>>> root 16097 0.0 0.0 82624 2784 ? S 10:20 0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
>>> root 16099 0.0 0.0 82624 2660 ? S 10:20 0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
>>> root 16104 0.0 0.0 82624 2448 ? S 10:20 0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
>>> root 16515 0.0 0.0 103244 852 pts/0 S+ 10:21 0:00 grep crm_mon
>>>
>>> Output from corosync.log
>>>
>>> May 09 10:20:51 [3100] pgdb3.cha.arin.net lrmd: info: process_lrmd_get_rsc_info: Resource 'SNMPMon' not found (3 active resources)
>>> May 09 10:20:51 [3100] pgdb3.cha.arin.net lrmd: info: process_lrmd_rsc_register: Added 'SNMPMon' to the rsc list (4 active resources)
>>> May 09 10:20:52 [3103] pgdb3.cha.arin.net crmd: info: services_os_action_execute: Managed ClusterMon_meta-data_0 process 16010 exited with rc=0
>>> May 09 10:20:52 [3103] pgdb3.cha.arin.net crmd: notice: process_lrm_event: LRM operation SNMPMon_monitor_0 (call=61, rc=7, cib-update=28, confirmed=true) not running
>>> May 09 10:20:52 [3103] pgdb3.cha.arin.net crmd: notice: process_lrm_event: LRM operation SNMPMon_start_0 (call=64, rc=0, cib-update=29, confirmed=true) ok
>>> May 09 10:20:52 [3103] pgdb3.cha.arin.net crmd: notice: process_lrm_event: LRM operation SNMPMon_monitor_60000 (call=67, rc=0, cib-update=30, confirmed=false) ok
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 203 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130510/6d35ab67/attachment-0004.sig>
More information about the Pacemaker
mailing list