<div dir="ltr">Hi Vladislav/Andrew,<div><br></div><div>We found a similar problem - <a href="http://oss.clusterlabs.org/pipermail/pacemaker/2014-March/021245.html" target="_blank">http://oss.clusterlabs.org/pipermail/pacemaker/2014-March/021245.html</a> and found this on launchpad - <a href="https://bugs.launchpad.net/ubuntu/+source/libqb/+bug/1341496">https://bugs.launchpad.net/ubuntu/+source/libqb/+bug/1341496</a> . We upgraded libqb from 0.16 to 0.17.1 and seems to work.<br><br>Thank you for all your help,</div><div>Kiam</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Sep 12, 2014 at 12:06 PM, Vladislav Bogdanov <span dir="ltr">&lt;<a href="mailto:bubble@hoster-ok.com" target="_blank">bubble@hoster-ok.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">12.09.2014 05:00, Norbert Kiam Maclang wrote:<br>
&gt; Hi,<br>
&gt;<br>
&gt; After adding resource level fencing on drbd, I still ended up having<br>
&gt; problems with timeouts on drbd. Is there a recommended settings for<br>
&gt; this? I followed what is written in the drbd documentation -<br>
&gt; <a href="http://www.drbd.org/users-guide-emb/s-pacemaker-crm-drbd-backed-service.html" target="_blank">http://www.drbd.org/users-guide-emb/s-pacemaker-crm-drbd-backed-service.html</a><br>
&gt; , Another thing I can&#39;t understand is why during initial tests, even I<br>
&gt; reboot the vms several times, failover works. But after I soak it for a<br>
&gt; couple of hours (say for example 8 hours or more) and continue with the<br>
&gt; tests, it will not failover and experience split brain. I confirmed it<br>
&gt; though that everything is healthy before performing a reboot. Disk<br>
&gt; health and network is good, drbd is synced, time beetween servers is good.<br>
<br>
I recall I&#39;ve seen something similar a year ago (near the time your<br>
pacemaker version is dated). I do not remember what was the exact<br>
problem cause, but I saw that drbd RA timeouts because it waits for<br>
something (fencing) in the kernel space to be done. drbd calls userspace<br>
scripts from within kernelspace, and you&#39;ll see them in the process list<br>
with the drbd kernel thread as a parent.<br>
<br>
I&#39;d also upgrade your corosync configuration from &quot;member&quot; to &quot;nodelist&quot;<br>
syntax, specifying &quot;name&quot; parameter together with ring0_addr for nodes<br>
(that parameter is not referenced in corosync docs but should be<br>
somewhere in the Pacemaker Explained - it is used only by the pacemaker).<br>
<br>
Also there is trace_ra functionality support in both pacemaker and crmsh<br>
(cannot say if that is supported in versions you have though, probably<br>
yes) so you may want to play with that to get the exact picture from the<br>
resource agent.<br>
<br>
Anyways, upgrading to 1.1.12 and more recent crmsh is nice to have for<br>
you because you may be just hitting a long-ago solved and forgotten<br>
bug/issue.<br>
<br>
Concerning your<br>
&gt;       expected-quorum-votes=&quot;1&quot;<br>
<br>
You need to configure votequorum in corosync with two_node: 1 instead of<br>
that line.<br>
<br>
&gt;<br>
&gt; # Logs:<br>
&gt; node01 lrmd[1036]:  warning: child_timeout_callback:<br>
&gt; drbd_pg_monitor_29000 process (PID 27744) timed out<br>
&gt; node01 lrmd[1036]:  warning: operation_finished:<br>
&gt; drbd_pg_monitor_29000:27744 - timed out after 20000ms<br>
&gt; node01 crmd[1039]:    error: process_lrm_event: LRM operation<br>
&gt; drbd_pg_monitor_29000 (69) Timed Out (timeout=20000ms)<br>
&gt; node01 crmd[1039]:  warning: update_failcount: Updating failcount for<br>
&gt; drbd_pg on tyo1mqdb01p after failed monitor: rc=1 (update=value++,<br>
&gt; time=1410486352)<br>
&gt;<br>
&gt; Thanks,<br>
&gt; Kiam<br>
&gt;<br>
&gt; On Thu, Sep 11, 2014 at 6:58 PM, Norbert Kiam Maclang<br>
&gt; &lt;<a href="mailto:norbert.kiam.maclang@gmail.com">norbert.kiam.maclang@gmail.com</a> &lt;mailto:<a href="mailto:norbert.kiam.maclang@gmail.com">norbert.kiam.maclang@gmail.com</a>&gt;&gt;<br>
&gt; wrote:<br>
&gt;<br>
&gt;     Thank you Vladislav.<br>
&gt;<br>
&gt;     I have configured resource level fencing on drbd and removed<br>
&gt;     wfc-timeout and defr-wfc-timeout (is this required?). My drbd<br>
&gt;     configuration is now:<br>
&gt;<br>
&gt;     resource pg {<br>
&gt;       device /dev/drbd0;<br>
&gt;       disk /dev/vdb;<br>
&gt;       meta-disk internal;<br>
&gt;       disk {<br>
&gt;         fencing resource-only;<br>
&gt;         on-io-error detach;<br>
&gt;         resync-rate 40M;<br>
&gt;       }<br>
&gt;       handlers {<br>
&gt;         fence-peer &quot;/usr/lib/drbd/crm-fence-peer.sh&quot;;<br>
&gt;         after-resync-target &quot;/usr/lib/drbd/crm-unfence-peer.sh&quot;;<br>
&gt;         split-brain &quot;/usr/lib/drbd/notify-split-brain.sh nkbm&quot;;<br>
&gt;       }<br>
&gt;       on node01 {<br>
&gt;         address <a href="http://10.2.136.52:7789" target="_blank">10.2.136.52:7789</a> &lt;<a href="http://10.2.136.52:7789" target="_blank">http://10.2.136.52:7789</a>&gt;;<br>
&gt;       }<br>
&gt;       on node02 {<br>
&gt;         address <a href="http://10.2.136.55:7789" target="_blank">10.2.136.55:7789</a> &lt;<a href="http://10.2.136.55:7789" target="_blank">http://10.2.136.55:7789</a>&gt;;<br>
&gt;       }<br>
&gt;       net {<br>
&gt;         verify-alg md5;<br>
&gt;         after-sb-0pri discard-zero-changes;<br>
&gt;         after-sb-1pri discard-secondary;<br>
&gt;         after-sb-2pri disconnect;<br>
&gt;       }<br>
&gt;     }<br>
&gt;<br>
&gt;     Failover works on my initial test (restarting both nodes alternately<br>
&gt;     - this always works). Will wait for a couple of hours after doing a<br>
&gt;     failover test again (Which always fail on my previous setup).<br>
&gt;<br>
&gt;     Thank you!<br>
&gt;     Kiam<br>
&gt;<br>
&gt;     On Thu, Sep 11, 2014 at 2:14 PM, Vladislav Bogdanov<br>
&gt;     &lt;<a href="mailto:bubble@hoster-ok.com">bubble@hoster-ok.com</a> &lt;mailto:<a href="mailto:bubble@hoster-ok.com">bubble@hoster-ok.com</a>&gt;&gt; wrote:<br>
&gt;<br>
&gt;         11.09.2014 05:57, Norbert Kiam Maclang wrote:<br>
&gt;         &gt; Is this something to do with quorum? But I already set<br>
&gt;<br>
&gt;         You&#39;d need to configure fencing at the drbd resources level.<br>
&gt;<br>
&gt;         <a href="http://www.drbd.org/users-guide-emb/s-pacemaker-fencing.html#s-pacemaker-fencing-cib" target="_blank">http://www.drbd.org/users-guide-emb/s-pacemaker-fencing.html#s-pacemaker-fencing-cib</a><br>
&gt;<br>
&gt;<br>
&gt;         &gt;<br>
&gt;         &gt; property no-quorum-policy=&quot;ignore&quot; \<br>
&gt;         &gt; expected-quorum-votes=&quot;1&quot;<br>
&gt;         &gt;<br>
&gt;         &gt; Thanks in advance,<br>
&gt;         &gt; Kiam<br>
&gt;         &gt;<br>
&gt;         &gt; On Thu, Sep 11, 2014 at 10:09 AM, Norbert Kiam Maclang<br>
&gt;         &gt; &lt;<a href="mailto:norbert.kiam.maclang@gmail.com">norbert.kiam.maclang@gmail.com</a><br>
&gt;         &lt;mailto:<a href="mailto:norbert.kiam.maclang@gmail.com">norbert.kiam.maclang@gmail.com</a>&gt;<br>
&gt;         &lt;mailto:<a href="mailto:norbert.kiam.maclang@gmail.com">norbert.kiam.maclang@gmail.com</a><br>
&gt;         &lt;mailto:<a href="mailto:norbert.kiam.maclang@gmail.com">norbert.kiam.maclang@gmail.com</a>&gt;&gt;&gt;<br>
&gt;         &gt; wrote:<br>
&gt;         &gt;<br>
&gt;         &gt;     Hi,<br>
&gt;         &gt;<br>
&gt;         &gt;     Please help me understand what is causing the problem. I<br>
&gt;         have a 2<br>
&gt;         &gt;     node cluster running on vms using KVM. Each vm (I am using<br>
&gt;         Ubuntu<br>
&gt;         &gt;     14.04) runs on a separate hypervisor on separate machines.<br>
&gt;         All are<br>
&gt;         &gt;     working well during testing (I restarted the vms<br>
&gt;         alternately), but<br>
&gt;         &gt;     after a day when I kill the other node, I always end up<br>
&gt;         corosync and<br>
&gt;         &gt;     pacemaker hangs on the surviving node. Date and time on<br>
&gt;         the vms are<br>
&gt;         &gt;     in sync, I use unicast, tcpdump shows both nodes exchanges,<br>
&gt;         &gt;     confirmed that DRBD is healthy and crm_mon show good<br>
&gt;         status before I<br>
&gt;         &gt;     kill the other node. Below are my configurations and<br>
&gt;         versions I used:<br>
&gt;         &gt;<br>
&gt;         &gt;     corosync             2.3.3-1ubuntu1<br>
&gt;         &gt;     crmsh                1.2.5+hg1034-1ubuntu3<br>
&gt;         &gt;     drbd8-utils          2:8.4.4-1ubuntu1<br>
&gt;         &gt;     libcorosync-common4  2.3.3-1ubuntu1<br>
&gt;         &gt;     libcrmcluster4       1.1.10+git20130802-1ubuntu2<br>
&gt;         &gt;     libcrmcommon3        1.1.10+git20130802-1ubuntu2<br>
&gt;         &gt;     libcrmservice1       1.1.10+git20130802-1ubuntu2<br>
&gt;         &gt;     pacemaker            1.1.10+git20130802-1ubuntu2<br>
&gt;         &gt;     pacemaker-cli-utils  1.1.10+git20130802-1ubuntu2<br>
&gt;         &gt;     postgresql-9.3       9.3.5-0ubuntu0.14.04.1<br>
&gt;         &gt;<br>
&gt;         &gt;     # /etc/corosync/corosync:<br>
&gt;         &gt;     totem {<br>
&gt;         &gt;     version: 2<br>
&gt;         &gt;     token: 3000<br>
&gt;         &gt;     token_retransmits_before_loss_const: 10<br>
&gt;         &gt;     join: 60<br>
&gt;         &gt;     consensus: 3600<br>
&gt;         &gt;     vsftype: none<br>
&gt;         &gt;     max_messages: 20<br>
&gt;         &gt;     clear_node_high_bit: yes<br>
&gt;         &gt;      secauth: off<br>
&gt;         &gt;      threads: 0<br>
&gt;         &gt;      rrp_mode: none<br>
&gt;         &gt;      interface {<br>
&gt;         &gt;                     member {<br>
&gt;         &gt;                             memberaddr: 10.2.136.56<br>
&gt;         &gt;                     }<br>
&gt;         &gt;                     member {<br>
&gt;         &gt;                             memberaddr: 10.2.136.57<br>
&gt;         &gt;                     }<br>
&gt;         &gt;                     ringnumber: 0<br>
&gt;         &gt;                     bindnetaddr: 10.2.136.0<br>
&gt;         &gt;                     mcastport: 5405<br>
&gt;         &gt;             }<br>
&gt;         &gt;             transport: udpu<br>
&gt;         &gt;     }<br>
&gt;         &gt;     amf {<br>
&gt;         &gt;     mode: disabled<br>
&gt;         &gt;     }<br>
&gt;         &gt;     quorum {<br>
&gt;         &gt;     provider: corosync_votequorum<br>
&gt;         &gt;     expected_votes: 1<br>
&gt;         &gt;     }<br>
&gt;         &gt;     aisexec {<br>
&gt;         &gt;             user:   root<br>
&gt;         &gt;             group:  root<br>
&gt;         &gt;     }<br>
&gt;         &gt;     logging {<br>
&gt;         &gt;             fileline: off<br>
&gt;         &gt;             to_stderr: yes<br>
&gt;         &gt;             to_logfile: no<br>
&gt;         &gt;             to_syslog: yes<br>
&gt;         &gt;     syslog_facility: daemon<br>
&gt;         &gt;             debug: off<br>
&gt;         &gt;             timestamp: on<br>
&gt;         &gt;             logger_subsys {<br>
&gt;         &gt;                     subsys: AMF<br>
&gt;         &gt;                     debug: off<br>
&gt;         &gt;                     tags:<br>
&gt;         enter|leave|trace1|trace2|trace3|trace4|trace6<br>
&gt;         &gt;             }<br>
&gt;         &gt;     }<br>
&gt;         &gt;<br>
&gt;         &gt;     # /etc/corosync/service.d/pcmk:<br>
&gt;         &gt;     service {<br>
&gt;         &gt;       name: pacemaker<br>
&gt;         &gt;       ver: 1<br>
&gt;         &gt;     }<br>
&gt;         &gt;<br>
&gt;         &gt;     /etc/drbd.d/global_common.conf:<br>
&gt;         &gt;     global {<br>
&gt;         &gt;     usage-count no;<br>
&gt;         &gt;     }<br>
&gt;         &gt;<br>
&gt;         &gt;     common {<br>
&gt;         &gt;     net {<br>
&gt;         &gt;                     protocol C;<br>
&gt;         &gt;     }<br>
&gt;         &gt;     }<br>
&gt;         &gt;<br>
&gt;         &gt;     # /etc/drbd.d/pg.res:<br>
&gt;         &gt;     resource pg {<br>
&gt;         &gt;       device /dev/drbd0;<br>
&gt;         &gt;       disk /dev/vdb;<br>
&gt;         &gt;       meta-disk internal;<br>
&gt;         &gt;       startup {<br>
&gt;         &gt;         wfc-timeout 15;<br>
&gt;         &gt;         degr-wfc-timeout 60;<br>
&gt;         &gt;       }<br>
&gt;         &gt;       disk {<br>
&gt;         &gt;         on-io-error detach;<br>
&gt;         &gt;         resync-rate 40M;<br>
&gt;         &gt;       }<br>
&gt;         &gt;       on node01 {<br>
&gt;         &gt;         address <a href="http://10.2.136.56:7789" target="_blank">10.2.136.56:7789</a> &lt;<a href="http://10.2.136.56:7789" target="_blank">http://10.2.136.56:7789</a>&gt;<br>
&gt;         &lt;<a href="http://10.2.136.56:7789" target="_blank">http://10.2.136.56:7789</a>&gt;;<br>
&gt;         &gt;       }<br>
&gt;         &gt;       on node02 {<br>
&gt;         &gt;         address <a href="http://10.2.136.57:7789" target="_blank">10.2.136.57:7789</a> &lt;<a href="http://10.2.136.57:7789" target="_blank">http://10.2.136.57:7789</a>&gt;<br>
&gt;         &lt;<a href="http://10.2.136.57:7789" target="_blank">http://10.2.136.57:7789</a>&gt;;<br>
&gt;         &gt;       }<br>
&gt;         &gt;       net {<br>
&gt;         &gt;         verify-alg md5;<br>
&gt;         &gt;         after-sb-0pri discard-zero-changes;<br>
&gt;         &gt;         after-sb-1pri discard-secondary;<br>
&gt;         &gt;         after-sb-2pri disconnect;<br>
&gt;         &gt;       }<br>
&gt;         &gt;     }<br>
&gt;         &gt;<br>
&gt;         &gt;     # Pacemaker configuration:<br>
&gt;         &gt;     node $id=&quot;167938104&quot; node01<br>
&gt;         &gt;     node $id=&quot;167938105&quot; node02<br>
&gt;         &gt;     primitive drbd_pg ocf:linbit:drbd \<br>
&gt;         &gt;     params drbd_resource=&quot;pg&quot; \<br>
&gt;         &gt;     op monitor interval=&quot;29s&quot; role=&quot;Master&quot; \<br>
&gt;         &gt;     op monitor interval=&quot;31s&quot; role=&quot;Slave&quot;<br>
&gt;         &gt;     primitive fs_pg ocf:heartbeat:Filesystem \<br>
&gt;         &gt;     params device=&quot;/dev/drbd0&quot;<br>
&gt;         directory=&quot;/var/lib/postgresql/9.3/main&quot;<br>
&gt;         &gt;     fstype=&quot;ext4&quot;<br>
&gt;         &gt;     primitive ip_pg ocf:heartbeat:IPaddr2 \<br>
&gt;         &gt;     params ip=&quot;10.2.136.59&quot; cidr_netmask=&quot;24&quot; nic=&quot;eth0&quot;<br>
&gt;         &gt;     primitive lsb_pg lsb:postgresql<br>
&gt;         &gt;     group PGServer fs_pg lsb_pg ip_pg<br>
&gt;         &gt;     ms ms_drbd_pg drbd_pg \<br>
&gt;         &gt;     meta master-max=&quot;1&quot; master-node-max=&quot;1&quot; clone-max=&quot;2&quot;<br>
&gt;         &gt;     clone-node-max=&quot;1&quot; notify=&quot;true&quot;<br>
&gt;         &gt;     colocation pg_on_drbd inf: PGServer ms_drbd_pg:Master<br>
&gt;         &gt;     order pg_after_drbd inf: ms_drbd_pg:promote PGServer:start<br>
&gt;         &gt;     property $id=&quot;cib-bootstrap-options&quot; \<br>
&gt;         &gt;     dc-version=&quot;1.1.10-42f2063&quot; \<br>
&gt;         &gt;     cluster-infrastructure=&quot;corosync&quot; \<br>
&gt;         &gt;     stonith-enabled=&quot;false&quot; \<br>
&gt;         &gt;     no-quorum-policy=&quot;ignore&quot;<br>
&gt;         &gt;     rsc_defaults $id=&quot;rsc-options&quot; \<br>
&gt;         &gt;     resource-stickiness=&quot;100&quot;<br>
&gt;         &gt;<br>
&gt;         &gt;     # Logs on node01<br>
&gt;         &gt;     Sep 10 10:25:33 node01 crmd[1019]:   notice:<br>
&gt;         peer_update_callback:<br>
&gt;         &gt;     Our peer on the DC is dead<br>
&gt;         &gt;     Sep 10 10:25:33 node01 crmd[1019]:   notice:<br>
&gt;         do_state_transition:<br>
&gt;         &gt;     State transition S_NOT_DC -&gt; S_ELECTION [ input=I_ELECTION<br>
&gt;         &gt;     cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]<br>
&gt;         &gt;     Sep 10 10:25:33 node01 crmd[1019]:   notice:<br>
&gt;         do_state_transition:<br>
&gt;         &gt;     State transition S_ELECTION -&gt; S_INTEGRATION [<br>
&gt;         input=I_ELECTION_DC<br>
&gt;         &gt;     cause=C_FSA_INTERNAL origin=do_election_check ]<br>
&gt;         &gt;     Sep 10 10:25:33 node01 corosync[940]:   [TOTEM ] A new<br>
&gt;         membership<br>
&gt;         &gt;     (<a href="http://10.2.136.56:52" target="_blank">10.2.136.56:52</a> &lt;<a href="http://10.2.136.56:52" target="_blank">http://10.2.136.56:52</a>&gt;<br>
&gt;         &lt;<a href="http://10.2.136.56:52" target="_blank">http://10.2.136.56:52</a>&gt;) was formed. Members left:<br>
&gt;         &gt;     167938105<br>
&gt;         &gt;     Sep 10 10:25:45 node01 kernel: [74452.740024] d-con pg:<br>
&gt;         PingAck did<br>
&gt;         &gt;     not arrive in time.<br>
&gt;         &gt;     Sep 10 10:25:45 node01 kernel: [74452.740169] d-con pg: peer(<br>
&gt;         &gt;     Primary -&gt; Unknown ) conn( Connected -&gt; NetworkFailure ) pdsk(<br>
&gt;         &gt;     UpToDate -&gt; DUnknown )<br>
&gt;         &gt;     Sep 10 10:25:45 node01 kernel: [74452.740987] d-con pg:<br>
&gt;         asender<br>
&gt;         &gt;     terminated<br>
&gt;         &gt;     Sep 10 10:25:45 node01 kernel: [74452.740999] d-con pg:<br>
&gt;         Terminating<br>
&gt;         &gt;     drbd_a_pg<br>
&gt;         &gt;     Sep 10 10:25:45 node01 kernel: [74452.741235] d-con pg:<br>
&gt;         Connection<br>
&gt;         &gt;     closed<br>
&gt;         &gt;     Sep 10 10:25:45 node01 kernel: [74452.741259] d-con pg: conn(<br>
&gt;         &gt;     NetworkFailure -&gt; Unconnected )<br>
&gt;         &gt;     Sep 10 10:25:45 node01 kernel: [74452.741260] d-con pg:<br>
&gt;         receiver<br>
&gt;         &gt;     terminated<br>
&gt;         &gt;     Sep 10 10:25:45 node01 kernel: [74452.741261] d-con pg:<br>
&gt;         Restarting<br>
&gt;         &gt;     receiver thread<br>
&gt;         &gt;     Sep 10 10:25:45 node01 kernel: [74452.741262] d-con pg:<br>
&gt;         receiver<br>
&gt;         &gt;     (re)started<br>
&gt;         &gt;     Sep 10 10:25:45 node01 kernel: [74452.741269] d-con pg: conn(<br>
&gt;         &gt;     Unconnected -&gt; WFConnection )<br>
&gt;         &gt;     Sep 10 10:26:12 node01 lrmd[1016]:  warning:<br>
&gt;         child_timeout_callback:<br>
&gt;         &gt;     drbd_pg_monitor_31000 process (PID 8445) timed out<br>
&gt;         &gt;     Sep 10 10:26:12 node01 lrmd[1016]:  warning:<br>
&gt;         operation_finished:<br>
&gt;         &gt;     drbd_pg_monitor_31000:8445 - timed out after 20000ms<br>
&gt;         &gt;     Sep 10 10:26:12 node01 crmd[1019]:    error:<br>
&gt;         process_lrm_event: LRM<br>
&gt;         &gt;     operation drbd_pg_monitor_31000 (30) Timed Out<br>
&gt;         (timeout=20000ms)<br>
&gt;         &gt;     Sep 10 10:26:32 node01 crmd[1019]:  warning: cib_rsc_callback:<br>
&gt;         &gt;     Resource update 23 failed: (rc=-62) Timer expired<br>
&gt;         &gt;     Sep 10 10:27:03 node01 lrmd[1016]:  warning:<br>
&gt;         child_timeout_callback:<br>
&gt;         &gt;     drbd_pg_monitor_31000 process (PID 8693) timed out<br>
&gt;         &gt;     Sep 10 10:27:03 node01 lrmd[1016]:  warning:<br>
&gt;         operation_finished:<br>
&gt;         &gt;     drbd_pg_monitor_31000:8693 - timed out after 20000ms<br>
&gt;         &gt;     Sep 10 10:27:54 node01 lrmd[1016]:  warning:<br>
&gt;         child_timeout_callback:<br>
&gt;         &gt;     drbd_pg_monitor_31000 process (PID 8938) timed out<br>
&gt;         &gt;     Sep 10 10:27:54 node01 lrmd[1016]:  warning:<br>
&gt;         operation_finished:<br>
&gt;         &gt;     drbd_pg_monitor_31000:8938 - timed out after 20000ms<br>
&gt;         &gt;     Sep 10 10:28:33 node01 crmd[1019]:    error: crm_timer_popped:<br>
&gt;         &gt;     Integration Timer (I_INTEGRATED) just popped in state<br>
&gt;         S_INTEGRATION!<br>
&gt;         &gt;     (180000ms)<br>
&gt;         &gt;     Sep 10 10:28:33 node01 crmd[1019]:  warning:<br>
&gt;         do_state_transition:<br>
&gt;         &gt;     Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED<br>
&gt;         &gt;     Sep 10 10:28:33 node01 crmd[1019]:  warning:<br>
&gt;         do_state_transition: 1<br>
&gt;         &gt;     cluster nodes failed to respond to the join offer.<br>
&gt;         &gt;     Sep 10 10:28:33 node01 crmd[1019]:   notice:<br>
&gt;         crmd_join_phase_log:<br>
&gt;         &gt;     join-1: node02=none<br>
&gt;         &gt;     Sep 10 10:28:33 node01 crmd[1019]:   notice:<br>
&gt;         crmd_join_phase_log:<br>
&gt;         &gt;     join-1: node01=welcomed<br>
&gt;         &gt;     Sep 10 10:28:45 node01 lrmd[1016]:  warning:<br>
&gt;         child_timeout_callback:<br>
&gt;         &gt;     drbd_pg_monitor_31000 process (PID 9185) timed out<br>
&gt;         &gt;     Sep 10 10:28:45 node01 lrmd[1016]:  warning:<br>
&gt;         operation_finished:<br>
&gt;         &gt;     drbd_pg_monitor_31000:9185 - timed out after 20000ms<br>
&gt;         &gt;     Sep 10 10:29:36 node01 lrmd[1016]:  warning:<br>
&gt;         child_timeout_callback:<br>
&gt;         &gt;     drbd_pg_monitor_31000 process (PID 9432) timed out<br>
&gt;         &gt;     Sep 10 10:29:36 node01 lrmd[1016]:  warning:<br>
&gt;         operation_finished:<br>
&gt;         &gt;     drbd_pg_monitor_31000:9432 - timed out after 20000ms<br>
&gt;         &gt;     Sep 10 10:30:27 node01 lrmd[1016]:  warning:<br>
&gt;         child_timeout_callback:<br>
&gt;         &gt;     drbd_pg_monitor_31000 process (PID 9680) timed out<br>
&gt;         &gt;     Sep 10 10:30:27 node01 lrmd[1016]:  warning:<br>
&gt;         operation_finished:<br>
&gt;         &gt;     drbd_pg_monitor_31000:9680 - timed out after 20000ms<br>
&gt;         &gt;     Sep 10 10:31:18 node01 lrmd[1016]:  warning:<br>
&gt;         child_timeout_callback:<br>
&gt;         &gt;     drbd_pg_monitor_31000 process (PID 9927) timed out<br>
&gt;         &gt;     Sep 10 10:31:18 node01 lrmd[1016]:  warning:<br>
&gt;         operation_finished:<br>
&gt;         &gt;     drbd_pg_monitor_31000:9927 - timed out after 20000ms<br>
&gt;         &gt;     Sep 10 10:32:09 node01 lrmd[1016]:  warning:<br>
&gt;         child_timeout_callback:<br>
&gt;         &gt;     drbd_pg_monitor_31000 process (PID 10174) timed out<br>
&gt;         &gt;     Sep 10 10:32:09 node01 lrmd[1016]:  warning:<br>
&gt;         operation_finished:<br>
&gt;         &gt;     drbd_pg_monitor_31000:10174 - timed out after 20000ms<br>
&gt;         &gt;<br>
&gt;         &gt;     #crm_mon on node01 before I kill the other vm:<br>
&gt;         &gt;     Stack: corosync<br>
&gt;         &gt;     Current DC: node02 (167938104) - partition with quorum<br>
&gt;         &gt;     Version: 1.1.10-42f2063<br>
&gt;         &gt;     2 Nodes configured<br>
&gt;         &gt;     5 Resources configured<br>
&gt;         &gt;<br>
&gt;         &gt;     Online: [ node01 node02 ]<br>
&gt;         &gt;<br>
&gt;         &gt;      Resource Group: PGServer<br>
&gt;         &gt;          fs_pg      (ocf::heartbeat:Filesystem):    Started node02<br>
&gt;         &gt;          lsb_pg     (lsb:postgresql):       Started node02<br>
&gt;         &gt;          ip_pg      (ocf::heartbeat:IPaddr2):       Started node02<br>
&gt;         &gt;      Master/Slave Set: ms_drbd_pg [drbd_pg]<br>
&gt;         &gt;          Masters: [ node02 ]<br>
&gt;         &gt;          Slaves: [ node01 ]<br>
&gt;         &gt;<br>
&gt;         &gt;     Thank you,<br>
&gt;         &gt;     Kiam<br>
&gt;         &gt;<br>
&gt;         &gt;<br>
&gt;         &gt;<br>
&gt;         &gt;<br>
&gt;         &gt; _______________________________________________<br>
&gt;         &gt; Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>
&gt;         &lt;mailto:<a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a>&gt;<br>
&gt;         &gt; <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>
&gt;         &gt;<br>
&gt;         &gt; Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>
&gt;         &gt; Getting started:<br>
&gt;         <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>
&gt;         &gt; Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>
&gt;         &gt;<br>
&gt;<br>
&gt;<br>
&gt;         _______________________________________________<br>
&gt;         Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>
&gt;         &lt;mailto:<a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a>&gt;<br>
&gt;         <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>
&gt;<br>
&gt;         Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>
&gt;         Getting started:<br>
&gt;         <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>
&gt;         Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; _______________________________________________<br>
&gt; Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>
&gt; <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>
&gt;<br>
&gt; Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>
&gt; Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>
&gt; Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>
&gt;<br>
<br>
<br>
_______________________________________________<br>
Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>
<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>
<br>
Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>
Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>
Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>
</blockquote></div><br></div>