[Pacemaker] reboot of non-vm host results in VM restart -- of chickens and eggs and VMs

Thu Dec 19 13:30:33 EST 2013

Hello,

Earlier emails related to this topic:
[pacemaker] chicken-egg-problem with libvirtd and a VM within cluster
[pacemaker] VirtualDomain problem after reboot of one node

My configuration:

RHEL6.5/CMAN/gfs2/Pacemaker/crmsh

pacemaker-libs-1.1.10-14.el6_5.1.x86_64
pacemaker-cli-1.1.10-14.el6_5.1.x86_64
pacemaker-1.1.10-14.el6_5.1.x86_64
pacemaker-cluster-libs-1.1.10-14.el6_5.1.x86_64

Two node HA VM cluster using real shared drive, not drbd.

Resources (relevant to this discussion):
primitive p_fs_images ocf:heartbeat:Filesystem \
primitive p_libvirtd lsb:libvirtd \
primitive virt ocf:heartbeat:VirtualDomain \

services chkconfig on: cman, clvmd, pacemaker
services chkconfig off: corosync, gfs2, libvirtd

Observation:

Rebooting the NON-host system results in the restart of the VM merrily
running on the host system.

Apparent cause:

Upon startup, Pacemaker apparently checks the status of configured
resources. However, the status request for the virt
(ocf:heartbeat:VirtualDomain) resource fails with:

Dec 18 12:19:30 [4147] mici-admin2       lrmd:  warning: child_timeout_callback:        virt_monitor_0 process (PID 4158) timed out
Dec 18 12:19:30 [4147] mici-admin2       lrmd:  warning: operation_finished:    virt_monitor_0:4158 - timed out after 200000ms
Dec 18 12:19:30 [4147] mici-admin2       lrmd:   notice: operation_finished:    virt_monitor_0:4158:stderr [ error: Failed to reconnect to the hypervisor ]
Dec 18 12:19:30 [4147] mici-admin2       lrmd:   notice: operation_finished:    virt_monitor_0:4158:stderr [ error: no valid connection ]
Dec 18 12:19:30 [4147] mici-admin2       lrmd:   notice: operation_finished:    virt_monitor_0:4158:stderr [ error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory ]

This failure then snowballs into an "orphan" situation in which the
running VM is restarted.

There was the suggestion of chkconfig on libvirtd (and presumably
deleting the resource) so that the /var/run/libvirt/libvirt-sock has
been created by service libvirtd. With libvirtd started by the system,
there is no un-needed reboot of the VM.

However, it may be that removing libvirtd from Pacemaker control leaves
the VM vdisk filesystem susceptible to corruption during a reboot
induced failover.

Question:

Is there an accepted Pacemaker configuration such that the un-needed
restart of the VM does not occur with the reboot of the non-host system?

Regards,
Bob Haxo

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131219/7be6caa4/attachment-0002.html>