[Pacemaker] reboot of non-vm host results in VM restart -- of chickens and eggs and VMs

Thu Dec 19 14:44:14 EST 2013

Hi Emmanuel,

Thanks for the suggestions. It is pretty clear what is the problem; it's
just not clear what is the fix or the work-around.  

Search the Pacemaker email archive for the email of Andrew Beekhof, 12
Oct 2012, "Re: [Pacemaker] chicken-egg-problem with libvirtd and a VM
within cluster", and the email to which he is responding (from Tom
Fernandes).

The status/monitor function of VirtualDomain fails because
the /var/run/libvirt/libvirt-sock has not been created.  This socket is
created by the lsb:libvirtd, but that is not started (as a resource)
until Pacemaker has heard back from heartbeat:VirtualDomain, which will
never happen until /var/run/libvirt/libvirt-sock has been created
("service libvirtd start" during this wait period does enable Pacemaker
to continue starting resources).  After the VirtualDomain monitor
function timeout, Pacemaker deals with the failing logic loop, resulting
in a re-start of the VM.

I hoping that "Unfortunately we still don't have a good answer for you."
is no longer the case, and that there is a fix or that there is a
community accepted workaround for the issue.

Regards,
Bob Haxo

On Thu, 2013-12-19 at 19:48 +0100, emmanuel segura wrote:
> Maybe the problem is this, the cluster try to start the vm and
> libvirtd isn't started
> 
> 
> 
> 
> 2013/12/19 emmanuel segura <emi2fast at gmail.com>
> 
>         if don't set your vm to start at boot time, you don't to put
>         in cluster libvirtd, maybe the problem isn't this, but why put
>         the os services in cluster, for example crond ...... :)
>         
>         
>         
>         
>         2013/12/19 Bob Haxo <bhaxo at sgi.com>
>         
>                 Hello,
>                 
>                 Earlier emails related to this topic:
>                 [pacemaker] chicken-egg-problem with libvirtd and a VM
>                 within cluster
>                 [pacemaker] VirtualDomain problem after reboot of one
>                 node
>                 
>                 
>                 My configuration:
>                 
>                 RHEL6.5/CMAN/gfs2/Pacemaker/crmsh
>                 
>                 pacemaker-libs-1.1.10-14.el6_5.1.x86_64
>                 pacemaker-cli-1.1.10-14.el6_5.1.x86_64
>                 pacemaker-1.1.10-14.el6_5.1.x86_64
>                 pacemaker-cluster-libs-1.1.10-14.el6_5.1.x86_64
>                 
>                 Two node HA VM cluster using real shared drive, not
>                 drbd.
>                 
>                 Resources (relevant to this discussion):
>                 primitive p_fs_images ocf:heartbeat:Filesystem \
>                 primitive p_libvirtd lsb:libvirtd \
>                 primitive virt ocf:heartbeat:VirtualDomain \
>                 
>                 services chkconfig on: cman, clvmd, pacemaker
>                 services chkconfig off: corosync, gfs2, libvirtd
>                 
>                 Observation:
>                 
>                 Rebooting the NON-host system results in the restart
>                 of the VM merrily running on the host system.
>                 
>                 Apparent cause:
>                 
>                 Upon startup, Pacemaker apparently checks the status
>                 of configured resources. However, the status request
>                 for the virt (ocf:heartbeat:VirtualDomain) resource
>                 fails with:
>                 
>                 
>                 Dec 18 12:19:30 [4147] mici-admin2       lrmd:  warning: child_timeout_callback:        virt_monitor_0 process (PID 4158) timed out
>                 Dec 18 12:19:30 [4147] mici-admin2       lrmd:  warning: operation_finished:    virt_monitor_0:4158 - timed out after 200000ms
>                 Dec 18 12:19:30 [4147] mici-admin2       lrmd:   notice: operation_finished:    virt_monitor_0:4158:stderr [ error: Failed to reconnect to the hypervisor ]
>                 Dec 18 12:19:30 [4147] mici-admin2       lrmd:   notice: operation_finished:    virt_monitor_0:4158:stderr [ error: no valid connection ]
>                 Dec 18 12:19:30 [4147] mici-admin2       lrmd:   notice: operation_finished:    virt_monitor_0:4158:stderr [ error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory ]
>                 
>                 
>                 This failure then snowballs into an "orphan" situation
>                 in which the running VM is restarted.
>                 
>                 There was the suggestion of chkconfig on libvirtd (and
>                 presumably deleting the resource) so that
>                 the /var/run/libvirt/libvirt-sock has been created by
>                 service libvirtd. With libvirtd started by the system,
>                 there is no un-needed reboot of the VM.
>                 
>                 However, it may be that removing libvirtd from
>                 Pacemaker control leaves the VM vdisk filesystem
>                 susceptible to corruption during a reboot induced
>                 failover.
>                 
>                 Question:
>                 
>                 Is there an accepted Pacemaker configuration such that
>                 the un-needed restart of the VM does not occur with
>                 the reboot of the non-host system?
>                 
>                 Regards,
>                 Bob Haxo
>                 
>                 
>                 
>                 
>                 
>                 
>                 
>                 
>                 
>                 
>                 
>                 _______________________________________________
>                 Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>                 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>                 
>                 Project Home: http://www.clusterlabs.org
>                 Getting started:
>                 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>                 Bugs: http://bugs.clusterlabs.org
>                 
>         
>         
>         
>         
>         -- 
>         esta es mi vida e me la vivo hasta que dios quiera
> 
> 
> 
> 
> -- 
> esta es mi vida e me la vivo hasta que dios quiera
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131219/9840ebdd/attachment-0003.html>