[Pacemaker] Why Did Pacemaker Restart this VirtualDomain Resource?

Tue Jun 19 14:07:48 EDT 2012

Hi Emmanuel, 

Here is the output from crm_mon -of1 : 

Operations: 
* Node quorumnode: 
p_drbd_mount2:0: migration-threshold=1000000 
+ (4) probe: rc=5 (not installed) 
p_drbd_mount1:0: migration-threshold=1000000 
+ (5) probe: rc=5 (not installed) 
p_drbd_vmstore:0: migration-threshold=1000000 
+ (6) probe: rc=5 (not installed) 
p_vm_myvm: migration-threshold=1000000 
+ (12) probe: rc=5 (not installed) 
* Node vmhost1: 
p_drbd_mount2:0: migration-threshold=1000000 
+ (34) promote: rc=0 (ok) 
+ (62) monitor: interval=10000ms rc=8 (master) 
p_drbd_vmstore:0: migration-threshold=1000000 
+ (26) promote: rc=0 (ok) 
+ (64) monitor: interval=10000ms rc=8 (master) 
p_fs_vmstore: migration-threshold=1000000 
+ (36) start: rc=0 (ok) 
+ (38) monitor: interval=20000ms rc=0 (ok) 
p_ping:0: migration-threshold=1000000 
+ (12) start: rc=0 (ok) 
+ (22) monitor: interval=20000ms rc=0 (ok) 
p_vm_myvm: migration-threshold=1000000 
+ (65) start: rc=0 (ok) 
+ (66) monitor: interval=10000ms rc=0 (ok) 
stonithvmhost2: migration-threshold=1000000 
+ (17) start: rc=0 (ok) 
p_drbd_mount1:0: migration-threshold=1000000 
+ (31) promote: rc=0 (ok) 
+ (63) monitor: interval=10000ms rc=8 (master) 
p_sysadmin_notify:0: migration-threshold=1000000 
+ (13) start: rc=0 (ok) 
+ (18) monitor: interval=10000ms rc=0 (ok) 
* Node vmhost2: 
p_drbd_mount2:1: migration-threshold=1000000 
+ (14) start: rc=0 (ok) 
+ (36) monitor: interval=20000ms rc=0 (ok) 
p_drbd_vmstore:1: migration-threshold=1000000 
+ (16) start: rc=0 (ok) 
+ (38) monitor: interval=20000ms rc=0 (ok) 
p_ping:1: migration-threshold=1000000 
+ (12) start: rc=0 (ok) 
+ (20) monitor: interval=20000ms rc=0 (ok) 
stonithquorumnode: migration-threshold=1000000 
+ (18) start: rc=0 (ok) 
stonithvmhost1: migration-threshold=1000000 
+ (17) start: rc=0 (ok) 
p_sysadmin_notify:1: migration-threshold=1000000 
+ (13) start: rc=0 (ok) 
+ (19) monitor: interval=10000ms rc=0 (ok) 
p_drbd_mount1:1: migration-threshold=1000000 
+ (15) start: rc=0 (ok) 
+ (37) monitor: interval=20000ms rc=0 (ok) 

Failed actions: 
p_drbd_mount2:0_monitor_0 (node=quorumnode, call=4, rc=5, status=complete): not installed 
p_drbd_mount1:0_monitor_0 (node=quorumnode, call=5, rc=5, status=complete): not installed 
p_drbd_vmstore:0_monitor_0 (node=quorumnode, call=6, rc=5, status=complete): not installed 
p_vm_myvm_monitor_0 (node=quorumnode, call=12, rc=5, status=complete): not installed 

What is the number in parenthesis before "start" or "monitor"? Is it the number of times this operation has occurred? Does this give any additional clues to what happened? What should I look for specifically in this output? 

Thanks, 

Andrew 
----- Original Message -----

From: "emmanuel segura" <emi2fast at gmail.com> 
To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org> 
Sent: Tuesday, June 19, 2012 12:12:34 PM 
Subject: Re: [Pacemaker] Why Did Pacemaker Restart this VirtualDomain Resource? 

Hello Andrew 

use crm_mon -of when your virtualdomain resource fail to see which operation resource report the problem 

2012/6/19 Andrew Martin < amartin at xes-inc.com > 

Hi Emmanuel, 

Thanks for the idea. I looked through the rest of the log and these "return code 8" errors on the ocf:linbit:drbd resources are occurring at other intervals (e.g. today) when the VirtualDomain resource is unaffected. This seems to indicate that these soft errors do not trigger a restart of the VirtualDomain resource. Is there anything else in the log that could indicate what caused this, or is there somewhere else I can look? 

Thanks, 

Andrew 

From: "emmanuel segura" < emi2fast at gmail.com > 
To: "The Pacemaker cluster resource manager" < pacemaker at oss.clusterlabs.org > 
Sent: Tuesday, June 19, 2012 9:57:19 AM 
Subject: Re: [Pacemaker] Why Did Pacemaker Restart this VirtualDomain Resource? 

I didn't see any error in your config, the only thing i seen it's this 
========================================================== 
Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: rsc:p_drbd_vmstore:0 
monitor[55] (pid 12323) 
Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: rsc:p_drbd_mount2:0 monitor[53] 
(pid 12324) 
Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: operation monitor[55] on 
p_drbd_vmstore:0 for client 3856: pid 12323 exited with return code 8 
Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: operation monitor[53] on 
p_drbd_mount2:0 for client 3856: pid 12324 exited with return code 8 
Jun 14 15:35:31 vmhost1 lrmd: [3853]: info: rsc:p_drbd_mount1:0 monitor[54] 
(pid 12396) 
========================================================= 
it can be a drbd problem, but i tell you the true i'm not sure 

====================================================== 
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-ocf-return-codes.html 
========================================================= 

2012/6/19 Andrew Martin < amartin at xes-inc.com > 

> Hello, 
> 
> I have a 3 node Pacemaker+Heartbeat cluster (two real nodes and one 
> "standby" quorum node) with Ubuntu 10.04 LTS on the nodes and using the 
> Pacemaker+Heartbeat packages from the Ubuntu HA Team PPA ( 
> https://launchpad.net/~ubuntu-ha-maintainers/+archive/ppa < https://launchpad.net/%7Eubuntu-ha-maintainers/+archive/ppa >). 

> I have configured 3 DRBD resources, a filesystem mount, and a KVM-based 
> virtual machine (using the VirtualDomain resource). I have constraints in 
> place so that the DRBD devices must become primary and the filesystem must 
> be mounted before the VM can start: 
> node $id="1ab0690c-5aa0-4d9c-ae4e-b662e0ca54e5" vmhost1 
> node $id="219e9bf6-ea99-41f4-895f-4c2c5c78484a" quorumnode \ 
> attributes standby="on" 
> node $id="645e09b4-aee5-4cec-a241-8bd4e03a78c3" vmhost2 
> primitive p_drbd_mount2 ocf:linbit:drbd \ 
> params drbd_resource="mount2" \ 
> op start interval="0" timeout="240" \ 
> op stop interval="0" timeout="100" \ 
> op monitor interval="10" role="Master" timeout="30" \ 
> op monitor interval="20" role="Slave" timeout="30" 
> primitive p_drbd_mount1 ocf:linbit:drbd \ 
> params drbd_resource="mount1" \ 
> op start interval="0" timeout="240" \ 
> op stop interval="0" timeout="100" \ 
> op monitor interval="10" role="Master" timeout="30" \ 
> op monitor interval="20" role="Slave" timeout="30" 
> primitive p_drbd_vmstore ocf:linbit:drbd \ 
> params drbd_resource="vmstore" \ 
> op start interval="0" timeout="240" \ 
> op stop interval="0" timeout="100" \ 
> op monitor interval="10" role="Master" timeout="30" \ 
> op monitor interval="20" role="Slave" timeout="30" 
> primitive p_fs_vmstore ocf:heartbeat:Filesystem \ 
> params device="/dev/drbd0" directory="/mnt/storage/vmstore" 
> fstype="ext4" \ 
> op start interval="0" timeout="60" \ 
> op stop interval="0" timeout="60" \ 
> op monitor interval="20" timeout="40" 
> primitive p_ping ocf:pacemaker:ping \ 
> params name="p_ping" host_list="192.168.1.25 192.168.1.26" 
> multiplier="1000" \ 
> op start interval="0" timeout="60" \ 
> op monitor interval="20s" timeout="60" 
> primitive p_sysadmin_notify ocf:heartbeat:MailTo \ 
> params email=" alert at example.com " \ 
> params subject="Pacemaker Change" \ 
> op start interval="0" timeout="10" \ 
> op stop interval="0" timeout="10" \ 
> op monitor interval="10" timeout="10" 
> primitive p_vm_myvm ocf:heartbeat:VirtualDomain \ 
> params config="/mnt/storage/vmstore/config/myvm.xml" \ 
> meta allow-migrate="false" target-role="Started" is-managed="true" 
> \ 
> op start interval="0" timeout="180" \ 
> op stop interval="0" timeout="180" \ 
> op monitor interval="10" timeout="30" 
> primitive stonithquorumnode stonith:external/webpowerswitch \ 
> params wps_ipaddr="192.168.3.100" wps_port="x" wps_username="xxx" 
> wps_password="xxx" hostname_to_stonith="quorumnode" 
> primitive stonithvmhost1 stonith:external/webpowerswitch \ 
> params wps_ipaddr="192.168.3.100" wps_port="x" wps_username="xxx" 
> wps_password="xxx" hostname_to_stonith="vmhost1" 
> primitive stonithvmhost2 stonith:external/webpowerswitch \ 
> params wps_ipaddr="192.168.3.100" wps_port="x" wps_username="xxx" 
> wps_password="xxx" hostname_to_stonith="vmhost2" 
> group g_vm p_fs_vmstore p_vm_myvm 
> ms ms_drbd_mount2 p_drbd_mount2 \ 
> meta master-max="1" master-node-max="1" clone-max="2" 
> clone-node-max="1" notify="true" 
> ms ms_drbd_mount1 p_drbd_mount1 \ 
> meta master-max="1" master-node-max="1" clone-max="2" 
> clone-node-max="1" notify="true" 
> ms ms_drbd_vmstore p_drbd_vmstore \ 
> meta master-max="1" master-node-max="1" clone-max="2" 
> clone-node-max="1" notify="true" 
> clone cl_ping p_ping \ 
> meta interleave="true" 
> clone cl_sysadmin_notify p_sysadmin_notify 
> location loc_run_on_most_connected g_vm \ 
> rule $id="loc_run_on_most_connected-rule" p_ping: defined p_ping 
> location loc_st_nodescan stonithquorumnode -inf: vmhost1 
> location loc_st_vmhost1 stonithvmhost1 -inf: vmhost1 
> location loc_st_vmhost2 stonithvmhost2 -inf: vmhost2 
> colocation c_drbd_libvirt_vm inf: g_vm ms_drbd_vmstore:Master 
> ms_drbd_tools:Master ms_drbd_crm:Master 
> order o_drbd-fs-vm inf: ms_drbd_vmstore:promote ms_drbd_tools:promote 
> ms_drbd_crm:promote g_vm:start 
> property $id="cib-bootstrap-options" \ 
> dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \ 
> cluster-infrastructure="Heartbeat" \ 
> stonith-enabled="true" \ 
> no-quorum-policy="freeze" \ 
> last-lrm-refresh="1337746179" 
> 
> This has been working well, however last week Pacemaker all of a sudden 
> stopped the p_vm_myvm resource and then started it up again. I have 
> attached the relevant section of /var/log/daemon.log - I am unable to 
> determine what caused Pacemaker to restart this resource. Based on the log, 
> could you tell me what event triggered this? 
> 
> Thanks, 
> 
> Andrew 
> 
> _______________________________________________ 
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 
> 
> 

-- 
esta es mi vida e me la vivo hasta que dios quiera 

_______________________________________________ 
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker 

Project Home: http://www.clusterlabs.org 
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
Bugs: http://bugs.clusterlabs.org 

_______________________________________________ 
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker 

Project Home: http://www.clusterlabs.org 
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
Bugs: http://bugs.clusterlabs.org 

-- 
esta es mi vida e me la vivo hasta que dios quiera 

_______________________________________________ 
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker 

Project Home: http://www.clusterlabs.org 
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
Bugs: http://bugs.clusterlabs.org 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120619/6e2c3c15/attachment-0003.html>