I don&#39;t think so has, I do have over similar cluster on the same network and didn&#39;t have any issues.<br>The only thing I can detect was that the virtual machine was like unresponsive.<br>But I think the VM crash was not like a power shutdown more like very slow then totaly crash.<br>

<br>Even if the drbd-nagios resource timeout, it should failover on the other node no ?<br><br>Regards,<br><br><br><div class="gmail_quote">On 20 February 2012 12:35, Andrew Beekhof <span dir="ltr">&lt;<a href="mailto:andrew@beekhof.net">andrew@beekhof.net</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">On Mon, Feb 13, 2012 at 9:57 PM, Hugo Deprez &lt;<a href="mailto:hugo.deprez@gmail.com">hugo.deprez@gmail.com</a>&gt; wrote:<br>


&gt; Hello,<br>

&gt;<br>

&gt; does anyone have an idea ?<br>

<br>

</div>Well I see:<br>

<br>

Feb  8 12:59:05 server01 crmd: [19470]: ERROR: process_lrm_event: LRM<br>

operation drbd-nagios:1_monitor_15000 (90) Timed Out (timeout=20000ms)<br>

Feb  8 13:00:05 server01 crmd: [19470]: WARN: cib_rsc_callback:<br>

Resource update 415 failed: (rc=-41) Remote node did not respond<br>

Feb  8 13:06:36 server01 crmd: [19470]: notice: ais_dispatch:<br>

Membership 128: quorum lost<br>

<br>

which looks suspicious.  Network problem?<br>

<div class="HOEnZb"><div class="h5"><br>

&gt;<br>

&gt; it seems that at 13:06:38 resources et started on slave member.<br>

&gt; But then there is something wrong on server01 :<br>

&gt;<br>

&gt; Feb  8 13:06:39 server01 pengine: [19469]: info: determine_online_status:<br>

&gt; Node server01 is online<br>

&gt; Feb  8 13:06:39 server01 pengine: [19469]: notice: unpack_rsc_op: Operation<br>

&gt; apache2_monitor_0 found resource apache2 active on server01<br>

&gt; Feb  8 13:06:39 server01 pengine: [19469]: notice: group_print:  Resource<br>

&gt; Group: supervision-grp<br>

&gt; Feb  8 13:06:39 server01 pengine: [19469]: notice: native_print:<br>

&gt; fs-data    (ocf::heartbeat:Filesystem):    Stopped<br>

&gt; Feb  8 13:06:39 server01 pengine: [19469]: notice: native_print:<br>

&gt; nagios-ip    (ocf::heartbeat:IPaddr2):    Stopped<br>

&gt; Feb  8 13:06:39 server01 pengine: [19469]: notice: native_print:<br>

&gt; apache2    (ocf::heartbeat:apache):    Started server01<br>

&gt; Feb  8 13:06:39 server01 pengine: [19469]: notice: native_print:<br>

&gt; nagios    (lsb:nagios3):    Stopped<br>

&gt;<br>

&gt;<br>

&gt; But I don&#39;t understand what fails if this is DRBD or apache2 causes the<br>

&gt; issue.<br>

&gt;<br>

&gt; Any idea ?<br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt; On 10 February 2012 09:39, Hugo Deprez &lt;<a href="mailto:hugo.deprez@gmail.com">hugo.deprez@gmail.com</a>&gt; wrote:<br>

&gt;&gt;<br>

&gt;&gt; Hello,<br>

&gt;&gt;<br>

&gt;&gt; please found attach to this mail the corosync logs.<br>

&gt;&gt; If you have any tips :)<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; Regards,<br>

&gt;&gt;<br>

&gt;&gt; Hugo<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; On 8 February 2012 15:39, Florian Haas &lt;<a href="mailto:florian@hastexo.com">florian@hastexo.com</a>&gt; wrote:<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; On Wed, Feb 8, 2012 at 2:29 PM, Hugo Deprez &lt;<a href="mailto:hugo.deprez@gmail.com">hugo.deprez@gmail.com</a>&gt;<br>

&gt;&gt;&gt; wrote:<br>

&gt;&gt;&gt; &gt; Dear community,<br>

&gt;&gt;&gt; &gt;<br>

&gt;&gt;&gt; &gt; I am currently running different corosync / drbd cluster using VM<br>

&gt;&gt;&gt; &gt; running on<br>

&gt;&gt;&gt; &gt; vmware esxi host.<br>

&gt;&gt;&gt; &gt; Guest Os are Debian Squeeze.<br>

&gt;&gt;&gt; &gt;<br>

&gt;&gt;&gt; &gt; the active member of the cluster just freeze the VM was unreachable.<br>

&gt;&gt;&gt; &gt; But the resources didn&#39;t achieved to move to the other node.<br>

&gt;&gt;&gt; &gt;<br>

&gt;&gt;&gt; &gt; My cluster has the following ressources :<br>

&gt;&gt;&gt; &gt;<br>

&gt;&gt;&gt; &gt; Resource Group: grp<br>

&gt;&gt;&gt; &gt;      fs-data    (ocf::heartbeat:Filesystem):<br>

&gt;&gt;&gt; &gt;      nagios-ip  (ocf::heartbeat:IPaddr2):<br>

&gt;&gt;&gt; &gt;      apache2    (ocf::heartbeat:apache):<br>

&gt;&gt;&gt; &gt;      nagios     (lsb:nagios3):<br>

&gt;&gt;&gt; &gt;      pnp        (lsb:npcd):<br>

&gt;&gt;&gt; &gt;<br>

&gt;&gt;&gt; &gt;<br>

&gt;&gt;&gt; &gt; I am currently troubleshooting this issue. I don&#39;t really know where to<br>

&gt;&gt;&gt; &gt; look. Of course I had a look at the logs, but it is pretty hard for me<br>

&gt;&gt;&gt; &gt; to<br>

&gt;&gt;&gt; &gt; understand what happen.<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; It&#39;s pretty hard for anyone else to understand _without_ logs. :)<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; &gt; I noticed that the VM crash at 12:09 and that the cluster only try to<br>

&gt;&gt;&gt; &gt; move<br>

&gt;&gt;&gt; &gt; the ressources at  12:58, this does not make sens for me. Or maybe the<br>

&gt;&gt;&gt; &gt; host<br>

&gt;&gt;&gt; &gt; wasn&#39;t totaly down ?<br>

&gt;&gt;&gt; &gt;<br>

&gt;&gt;&gt; &gt; Do you have any idea how I can troubleshoot ?<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; Log analysis is where I would start.<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; &gt; Last thing, I notice that If I start apache2 on the slave server,<br>

&gt;&gt;&gt; &gt; corosync<br>

&gt;&gt;&gt; &gt; didn&#39;t detect that the resource is started, could that be an issue ?<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; Sure it could, but Pacemaker should happily recover from that.<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; Cheers,<br>

&gt;&gt;&gt; Florian<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; --<br>

&gt;&gt;&gt; Need help with High Availability?<br>

&gt;&gt;&gt; <a href="http://www.hastexo.com/now" target="_blank">http://www.hastexo.com/now</a><br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; _______________________________________________<br>

&gt;&gt;&gt; Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>

&gt;&gt;&gt; <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>

&gt;&gt;&gt; Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

&gt;&gt;&gt; Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;<br>

&gt;<br>

&gt; _______________________________________________<br>

&gt; Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>

&gt; <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>

&gt;<br>

&gt; Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>

&gt; Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

&gt; Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>

&gt;<br>

<br>

_______________________________________________<br>

Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>

<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>

<br>

Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>

</div></div></blockquote></div><br>