[Pacemaker] Why Did Pacemaker Restart this VirtualDomain Resource?

Tue Jun 19 18:33:46 EDT 2012

On Tue, Jun 19, 2012 at 09:38:50AM -0500, Andrew Martin wrote:
> Hello, 
> 
> 
> I have a 3 node Pacemaker+Heartbeat cluster (two real nodes and one "standby" quorum node) with Ubuntu 10.04 LTS on the nodes and using the Pacemaker+Heartbeat packages from the Ubuntu HA Team PPA ( https://launchpad.net/~ubuntu-ha-maintainers/+archive/ppa ). I have configured 3 DRBD resources, a filesystem mount, and a KVM-based virtual machine (using the VirtualDomain resource). I have constraints in place so that the DRBD devices must become primary and the filesystem must be mounted before the VM can start: 

> location loc_run_on_most_connected g_vm \ 
> rule $id="loc_run_on_most_connected-rule" p_ping: defined p_ping 

This is the rule

> This has been working well, however last week Pacemaker all of a
> sudden stopped the p_vm_myvm resource and then started it up again. I
> have attached the relevant section of /var/log/daemon.log - I am
> unable to determine what caused Pacemaker to restart this resource.
> Based on the log, could you tell me what event triggered this? 
> 
> 
> Thanks, 
> 
> 
> Andrew 

> Jun 14 15:25:00 vmhost1 lrmd: [3853]: info: rsc:p_sysadmin_notify:0 monitor[18] (pid 3661)
> Jun 14 15:25:00 vmhost1 lrmd: [3853]: info: operation monitor[18] on p_sysadmin_notify:0 for client 3856: pid 3661 exited with return code 0
> Jun 14 15:26:42 vmhost1 cib: [3852]: info: cib_stats: Processed 219 operations (182.00us average, 0% utilization) in the last 10min
> Jun 14 15:32:43 vmhost1 lrmd: [3853]: info: operation monitor[22] on p_ping:0 for client 3856: pid 10059 exited with return code 0
> Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: rsc:p_drbd_vmstore:0 monitor[55] (pid 12323)
> Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: rsc:p_drbd_mount2:0 monitor[53] (pid 12324)
> Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: operation monitor[55] on p_drbd_vmstore:0 for client 3856: pid 12323 exited with return code 8
> Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: operation monitor[53] on p_drbd_mount2:0 for client 3856: pid 12324 exited with return code 8
> Jun 14 15:35:31 vmhost1 lrmd: [3853]: info: rsc:p_drbd_mount1:0 monitor[54] (pid 12396)
> Jun 14 15:35:31 vmhost1 lrmd: [3853]: info: operation monitor[54] on p_drbd_mount1:0 for client 3856: pid 12396 exited with return code 8
> Jun 14 15:36:42 vmhost1 cib: [3852]: info: cib_stats: Processed 220 operations (272.00us average, 0% utilization) in the last 10min
> Jun 14 15:37:34 vmhost1 lrmd: [3853]: info: rsc:p_vm_myvm monitor[57] (pid 14061)
> Jun 14 15:37:34 vmhost1 lrmd: [3853]: info: operation monitor[57] on p_vm_myvm for client 3856: pid 14061 exited with return code 0

> Jun 14 15:42:35 vmhost1 attrd: [3855]: notice: attrd_trigger_update: Sending flush op to all hosts for: p_ping (1000)
> Jun 14 15:42:35 vmhost1 attrd: [3855]: notice: attrd_perform_update: Sent update 163: p_ping=1000

And here the score on the location constraint changes for this node.

You asked for "run on most connected", and your pingd resource
determined that "the other" one was "better" connected.

> Jun 14 15:42:36 vmhost1 crmd: [3856]: info: do_lrm_rsc_op: Performing key=136:2351:0:7f6d66f7-cfe5-4820-8289-0e47d8c9102b op=p_vm_myvm_stop_0 )
> Jun 14 15:42:36 vmhost1 lrmd: [3853]: info: rsc:p_vm_myvm stop[58] (pid 18174)

...

> Jun 14 15:43:32 vmhost1 attrd: [3855]: notice: attrd_trigger_update: Sending flush op to all hosts for: p_ping (2000)
> Jun 14 15:43:32 vmhost1 attrd: [3855]: notice: attrd_perform_update: Sent update 165: p_ping=2000

And there it is back on 2000 again ...

	Lars

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.