[Pacemaker] reboot of non-vm host results in VM restart -- of chickens and eggs and VMs

Thu Dec 19 16:22:08 EST 2013

Hi Emmanuel,

> i don't see any reason for put libvirtd as primitive in pacemaker

Yes ... well, maybe.  During my testing of failure scenarios (in
particular, reboot of the VM host), several times the VM filesystem
ended up corrupted and I needed to reinstall the VM.  At least a couple
of these failures occurred when I was testing with the system starting
libvirtd and not controlling libvirtd start/stop via a cloned resource.

And, those failures are the reason that I'm seeking the wisdom of
others.

Now that I understand more the issues, I will be again testing system
start of libvirt, with more care.

Thanks,
Bob Haxo

On Thu, 2013-12-19 at 21:30 +0100, emmanuel segura wrote:
> remove the libvirtd from pacemaker and chkconfig libvirtd on every
> node, like that the cluster just manage the vm, maybe i wrong but i
> don't see any reason for put libvirtd as primitivi in pacemaker
> 
> 
> 
> 
> 2013/12/19 Bob Haxo <bhaxo at sgi.com>
> 
>         Hi Emmanuel,
>         
>         Thanks for the suggestions. It is pretty clear what is the
>         problem; it's just not clear what is the fix or the
>         work-around.  
>         
>         Search the Pacemaker email archive for the email of Andrew
>         Beekhof, 12 Oct 2012, "Re: [Pacemaker] chicken-egg-problem
>         with libvirtd and a VM within cluster", and the email to which
>         he is responding (from Tom Fernandes).
>         
>         The status/monitor function of VirtualDomain fails because
>         the /var/run/libvirt/libvirt-sock has not been created.  This
>         socket is created by the lsb:libvirtd, but that is not started
>         (as a resource) until Pacemaker has heard back from
>         heartbeat:VirtualDomain, which will never happen
>         until /var/run/libvirt/libvirt-sock has been created ("service
>         libvirtd start" during this wait period does enable Pacemaker
>         to continue starting resources).  After the VirtualDomain
>         monitor function timeout, Pacemaker deals with the failing
>         logic loop, resulting in a re-start of the VM.
>         
>         I hoping that "Unfortunately we still don't have a good answer
>         for you." is no longer the case, and that there is a fix or
>         that there is a community accepted workaround for the issue.
>         
>         
>         Regards,
>         Bob Haxo
>         
>         
>         
>         
>         
>         
>         
>         On Thu, 2013-12-19 at 19:48 +0100, emmanuel segura wrote: 
>         
>         > Maybe the problem is this, the cluster try to start the vm
>         > and libvirtd isn't started
>         > 
>         > 
>         > 
>         > 2013/12/19 emmanuel segura <emi2fast at gmail.com>
>         > 
>         >         if don't set your vm to start at boot time, you
>         >         don't to put in cluster libvirtd, maybe the problem
>         >         isn't this, but why put the os services in cluster,
>         >         for example crond ...... :)
>         >         
>         >         
>         >         
>         >         2013/12/19 Bob Haxo <bhaxo at sgi.com> 
>         >         
>         >                 Hello,
>         >                 
>         >                 Earlier emails related to this topic:
>         >                 [pacemaker] chicken-egg-problem with
>         >                 libvirtd and a VM within cluster
>         >                 [pacemaker] VirtualDomain problem after
>         >                 reboot of one node
>         >                 
>         >                 
>         >                 My configuration:
>         >                 
>         >                 RHEL6.5/CMAN/gfs2/Pacemaker/crmsh
>         >                 
>         >                 pacemaker-libs-1.1.10-14.el6_5.1.x86_64
>         >                 pacemaker-cli-1.1.10-14.el6_5.1.x86_64
>         >                 pacemaker-1.1.10-14.el6_5.1.x86_64
>         >                 pacemaker-cluster-libs-1.1.10-14.el6_5.1.x86_64
>         >                 
>         >                 Two node HA VM cluster using real shared
>         >                 drive, not drbd.
>         >                 
>         >                 Resources (relevant to this discussion):
>         >                 primitive p_fs_images
>         >                 ocf:heartbeat:Filesystem \
>         >                 primitive p_libvirtd lsb:libvirtd \
>         >                 primitive virt ocf:heartbeat:VirtualDomain \
>         >                 
>         >                 services chkconfig on: cman, clvmd,
>         >                 pacemaker
>         >                 services chkconfig off: corosync, gfs2,
>         >                 libvirtd
>         >                 
>         >                 Observation:
>         >                 
>         >                 Rebooting the NON-host system results in the
>         >                 restart of the VM merrily running on the
>         >                 host system.
>         >                 
>         >                 Apparent cause:
>         >                 
>         >                 Upon startup, Pacemaker apparently checks
>         >                 the status of configured resources. However,
>         >                 the status request for the virt
>         >                 (ocf:heartbeat:VirtualDomain) resource fails
>         >                 with:
>         >                 
>         >                 
>         >                 Dec 18 12:19:30 [4147] mici-admin2       lrmd:  warning: child_timeout_callback:        virt_monitor_0 process (PID 4158) timed out
>         >                 Dec 18 12:19:30 [4147] mici-admin2       lrmd:  warning: operation_finished:    virt_monitor_0:4158 - timed out after 200000ms
>         >                 Dec 18 12:19:30 [4147] mici-admin2       lrmd:   notice: operation_finished:    virt_monitor_0:4158:stderr [ error: Failed to reconnect to the hypervisor ]
>         >                 Dec 18 12:19:30 [4147] mici-admin2       lrmd:   notice: operation_finished:    virt_monitor_0:4158:stderr [ error: no valid connection ]
>         >                 Dec 18 12:19:30 [4147] mici-admin2       lrmd:   notice: operation_finished:    virt_monitor_0:4158:stderr [ error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory ]
>         >                 
>         >                 
>         >                 This failure then snowballs into an "orphan"
>         >                 situation in which the running VM is
>         >                 restarted.
>         >                 
>         >                 There was the suggestion of chkconfig on
>         >                 libvirtd (and presumably deleting the
>         >                 resource) so that
>         >                 the /var/run/libvirt/libvirt-sock has been
>         >                 created by service libvirtd. With libvirtd
>         >                 started by the system, there is no un-needed
>         >                 reboot of the VM.
>         >                 
>         >                 However, it may be that removing libvirtd
>         >                 from Pacemaker control leaves the VM vdisk
>         >                 filesystem susceptible to corruption during
>         >                 a reboot induced failover.
>         >                 
>         >                 Question:
>         >                 
>         >                 Is there an accepted Pacemaker configuration
>         >                 such that the un-needed restart of the VM
>         >                 does not occur with the reboot of the
>         >                 non-host system?
>         >                 
>         >                 Regards,
>         >                 Bob Haxo
>         >                 
>         >                 
>         >                 
>         >                 
>         >                 
>         >                 
>         >                 
>         >                 
>         >                 
>         >                 _______________________________________________
>         >                 Pacemaker mailing list:
>         >                 Pacemaker at oss.clusterlabs.org
>         >                 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>         >                 
>         >                 Project Home: http://www.clusterlabs.org
>         >                 Getting started:
>         >                 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>         >                 Bugs: http://bugs.clusterlabs.org
>         >                 
>         >         
>         >         
>         >         
>         >         
>         >         -- 
>         >         esta es mi vida e me la vivo hasta que dios quiera 
>         > 
>         > 
>         > 
>         > 
>         > -- 
>         > esta es mi vida e me la vivo hasta que dios quiera
>         > 
>         > _______________________________________________
>         > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>         > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>         > 
>         > Project Home: http://www.clusterlabs.org
>         > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>         > Bugs: http://bugs.clusterlabs.org
>         
>         
>         _______________________________________________
>         Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>         http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>         
>         Project Home: http://www.clusterlabs.org
>         Getting started:
>         http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>         Bugs: http://bugs.clusterlabs.org
>         
> 
> 
> 
> 
> -- 
> esta es mi vida e me la vivo hasta que dios quiera
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131219/0f069e22/attachment-0003.html>