<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Tue, Feb 24, 2015 at 2:07 AM, Andrew Beekhof <span dir="ltr">&lt;<a href="mailto:andrew@beekhof.net" target="_blank">andrew@beekhof.net</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span class=""><br>&gt; I have a 3-node cluster where node1 and node2 are running corosync+pacemaker and node3 is running corosync only (for quorum). Corosync 2.3.3, pacemaker 1.1.10. Everything worked fine the first couple of days.<br>

&gt;<br>

&gt; Once upon a time I discovered the following situation: node2 thinks that both node1 and node2 are online, but node1 thinks that node2 is down. Could you please say: how could it be? There are no connectivity problems between the nodes at the moment (maybe they were, but why the system hasn&#39;t recovered?).<br>

<br>

</span>The logs show connectivity problems occurring, so no doubt there.<br>

As to why it hasn&#39;t recovered, first check corosync - if it does not have a consistent view of the world pacemaker has no hope.<br>

Alternatively, I recall there was a bug that could be preventing this in your version.  So if corosync looks fine, perhaps try an upgrade.</blockquote><div><br></div><div>Thanks.</div><div>Are you talking about this bug: <a href="https://bugs.launchpad.net/ubuntu/+source/libqb/+bug/1341496">https://bugs.launchpad.net/ubuntu/+source/libqb/+bug/1341496</a> ?</div><div><br></div><div>I believe I reproduced the problem one more time (it&#39;s very unstable), the symptoms were the following:</div><div><br></div><div>1. Once upon a time node2 became down.</div><div>2. The last message from corosync at node1 was &quot;Quorum lost&quot; (I suspect there was a temporary misconnection with node3).</div><div>3. Then, in a couple of days, at node3 &quot;service corosync stop&quot; hanged (only killall -9 helps). I tried to run strace during the service is stopping, it shows:</div><div><div><br></div><div>[pid 19449] futex(0x7f580b4c62e0, FUTEX_WAIT_PRIVATE, 0, NULL &lt;unfinished ...&gt;</div><div>[pid 19448] --- SIGQUIT {si_signo=SIGQUIT, si_code=SI_USER, si_pid=28183, si_uid=0} ---</div><div>[pid 19448] write(6, &quot;\3\0\0\0&quot;, 4)     = 4</div><div>[pid 19448] rt_sigreturn()              = 360</div></div><div>... &lt;and repeats for 19448 again and again&gt;</div><div><br></div><div>where pstree shows:</div><div><br></div><div><div>init,1</div><div>  ├─corosync,19448<br></div><div>  │   └─{corosync},19449</div></div><div><br></div><div>4. As well as at node1: &quot;service corosync stop&quot; hangs at node1 too with same symptoms, only killall -9 helps.</div><div>5. Restarting corosync &amp; pacemaker at node1 and node2 solved the problem.</div><div><br></div><div>Could you please say is it related to the above bug in libqb?</div><div><br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div><div class="h5">

&gt; The &quot;crm status&quot; is below. What other logs should I attach for the diagnostics?<br>

&gt;<br>

&gt; Also, &quot;service corosync stop&quot; on node1 hangs forever with no additional lines in logs, so I cannot even stop the service. (But after &quot;service corosync stop&quot; on node1 the node node2 starts thinking that node1 is offline, although the command still hangs.)<br>

&gt;<br>

&gt;<br>

&gt; root@node2:~# crm status<br>

&gt; Current DC: node1 (1760315215) - partition with quorum<br>

&gt; 2 Nodes configured<br>

&gt; 6 Resources configured<br>

&gt; Online: [ node1 node2 ]<br>

&gt; Master/Slave Set: ms_drbd [drbd]<br>

&gt;      Masters: [ node2 ]<br>

&gt;      Slaves: [ node1 ]<br>

&gt; Resource Group: server<br>

&gt;      fs       (ocf::heartbeat:Filesystem):    Started node2<br>

&gt;      postgresql       (lsb:postgresql):       Started node2<br>

&gt;      bind9    (lsb:bind9):    Started node2<br>

&gt;      nginx    (lsb:nginx):    Started node2<br>

&gt;<br>

&gt;<br>

&gt; root@node1:/var/log# crm status<br>

&gt; Current DC: node1 (1760315215) - partition with quorum<br>

&gt; 2 Nodes configured<br>

&gt; 6 Resources configured<br>

&gt; Online: [ node1 ]<br>

&gt; OFFLINE: [ node2 ]<br>

&gt; Master/Slave Set: ms_drbd [drbd]<br>

&gt;      Masters: [ node1 ]<br>

&gt;      Stopped: [ node2 ]<br>

&gt; Resource Group: server<br>

&gt;      fs       (ocf::heartbeat:Filesystem):    Started node1<br>

&gt;      postgresql       (lsb:postgresql):       Started node1<br>

&gt;      bind9    (lsb:bind9):    Started node1<br>

&gt;      nginx    (lsb:nginx):    Started node1<br>

&gt; Failed actions:<br>

&gt;     drbd_promote_0 (node=node1, call=634, rc=1, status=Timed Out, last-rc-change=Thu Jan 22 10:30:08 2015, queued=20004ms, exec=0ms): unknown error<br>

&gt;<br>

&gt;<br>

&gt; A part of &quot;crm configure show&quot;:<br>

&gt;<br>

&gt; property $id=&quot;cib-bootstrap-options&quot; \<br>

&gt;         dc-version=&quot;1.1.10-42f2063&quot; \<br>

&gt;         cluster-infrastructure=&quot;corosync&quot; \<br>

&gt;         stonith-enabled=&quot;false&quot; \<br>

&gt;         last-lrm-refresh=&quot;1421250983&quot;<br>

&gt; rsc_defaults $id=&quot;rsc-options&quot; \<br>

&gt;         resource-stickiness=&quot;100&quot;<br>

&gt;<br>

&gt;<br>

&gt; Also I see in logs on node1 (maybe they&#39;re related to the issue, maybe not):<br>

&gt;<br>

&gt; Jan 22 10:14:02 node1 pengine[2772]:  warning: pe_fence_node: Node node2 is unclean because it is partially and/or un-expectedly down<br>

&gt; Jan 22 10:14:02 node1 pengine[2772]:  warning: determine_online_status: Node node2 is unclean<br>

&gt; Jan 22 10:14:02 node1 pengine[2772]:  warning: stage6: Node node2 is unclean!<br>

&gt; Jan 22 10:14:02 node1 pengine[2772]:  warning: stage6: YOUR RESOURCES ARE NOW LIKELY COMPROMISED<br>

&gt; Jan 22 10:14:02 node1 pengine[2772]:    error: stage6: ENABLE STONITH TO KEEP YOUR RESOURCES SAFE<br>

&gt;<br>

&gt;<br>

&gt; On node2 the logs are:<br>

&gt;<br>

&gt; Jan 22 10:13:57 node2 corosync[32761]:  [TOTEM ] A new membership (<a href="http://188.166.54.190:6276" target="_blank">188.166.54.190:6276</a>) was formed. Members left: 1760315215 13071578<br>

&gt; Jan 22 10:13:57 node2 crmd[311]:   notice: peer_update_callback: Our peer on the DC is dead<br>

&gt; Jan 22 10:13:57 node2 crmd[311]:   notice: do_state_transition: State transition S_NOT_DC -&gt; S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]<br>

&gt; Jan 22 10:13:57 node2 corosync[32761]:  [QUORUM] This node is within the non-primary component and will NOT provide any services.<br>

&gt; Jan 22 10:13:57 node2 corosync[32761]:  [QUORUM] Members[1]: 1017525950<br>

&gt; Jan 22 10:13:57 node2 crmd[311]:   notice: pcmk_quorum_notification: Membership 6276: quorum lost (1)<br>

&gt; Jan 22 10:13:57 node2 crmd[311]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was member)<br>

&gt; Jan 22 10:13:57 node2 crmd[311]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was member)<br>

&gt; Jan 22 10:13:57 node2 pacemakerd[302]:   notice: pcmk_quorum_notification: Membership 6276: quorum lost (1)<br>

&gt; Jan 22 10:13:57 node2 pacemakerd[302]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was member)<br>

&gt; Jan 22 10:13:57 node2 pacemakerd[302]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was member)<br>

&gt; Jan 22 10:13:57 node2 corosync[32761]:  [MAIN  ] Completed service synchronization, ready to provide service.<br>

&gt; Jan 22 10:14:01 node2 corosync[32761]:  [TOTEM ] A new membership (<a href="http://104.236.71.79:6288" target="_blank">104.236.71.79:6288</a>) was formed. Members joined: 1760315215 13071578<br>

&gt; Jan 22 10:14:02 node2 crmd[311]:    error: pcmk_cpg_membership: Node node1[1760315215] appears to be online even though we think it is dead<br>

&gt; Jan 22 10:14:02 node2 crmd[311]:   notice: crm_update_peer_state: pcmk_cpg_membership: Node node1[1760315215] - state is now member (was lost)<br>

&gt; Jan 22 10:14:03 node2 corosync[32761]:  [QUORUM] This node is within the primary component and will provide service.<br>

&gt; Jan 22 10:14:03 node2 corosync[32761]:  [QUORUM] Members[3]: 1760315215 13071578 1017525950<br>

&gt; Jan 22 10:14:03 node2 crmd[311]:   notice: pcmk_quorum_notification: Membership 6288: quorum acquired (3)<br>

&gt; Jan 22 10:14:03 node2 pacemakerd[302]:   notice: pcmk_quorum_notification: Membership 6288: quorum acquired (3)<br>

&gt; Jan 22 10:14:03 node2 pacemakerd[302]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[1760315215] - state is now member (was lost)<br>

&gt; Jan 22 10:14:03 node2 corosync[32761]:  [MAIN  ] Completed service synchronization, ready to provide service.<br>

&gt; Jan 22 10:14:03 node2 crmd[311]:   notice: corosync_node_name: Unable to get node name for nodeid 13071578<br>

&gt; Jan 22 10:14:03 node2 crmd[311]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[13071578] - state is now member (was lost)<br>

&gt; Jan 22 10:14:03 node2 pacemakerd[302]:   notice: corosync_node_name: Unable to get node name for nodeid 13071578<br>

&gt; Jan 22 10:14:03 node2 pacemakerd[302]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[13071578] - state is now member (was lost)<br>

&gt; Jan 22 10:14:03 node2 crmd[311]:  warning: do_log: FSA: Input I_JOIN_OFFER from route_message() received in state S_ELECTION<br>

&gt; Jan 22 10:14:04 node2 crmd[311]:   notice: do_state_transition: State transition S_ELECTION -&gt; S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]<br>

&gt; Jan 22 10:14:05 node2 attrd[310]:   notice: attrd_local_callback: Sending full refresh (origin=crmd)<br>

&gt; Jan 22 10:14:05 node2 attrd[310]:   notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd (10000)<br>

&gt; Jan 22 10:14:05 node2 attrd[310]:   notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)<br>

&gt; Jan 22 10:14:05 node2 crmd[311]:   notice: do_state_transition: State transition S_PENDING -&gt; S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]<br>

&gt; Jan 22 10:15:11 node2 corosync[32761]:  [TOTEM ] A new membership (<a href="http://104.236.71.79:6296" target="_blank">104.236.71.79:6296</a>) was formed. Members left: 13071578<br>

&gt; Jan 22 10:15:14 node2 corosync[32761]:  [TOTEM ] A new membership (<a href="http://128.199.116.218:6312" target="_blank">128.199.116.218:6312</a>) was formed. Members joined: 13071578 left: 1760315215<br>

&gt; Jan 22 10:15:17 node2 corosync[32761]:  [TOTEM ] A new membership (<a href="http://104.236.71.79:6324" target="_blank">104.236.71.79:6324</a>) was formed. Members joined: 1760315215<br>

&gt; Jan 22 10:15:19 node2 crmd[311]:   notice: peer_update_callback: Our peer on the DC is dead<br>

&gt; Jan 22 10:15:19 node2 crmd[311]:   notice: do_state_transition: State transition S_NOT_DC -&gt; S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]<br>

&gt; Jan 22 10:15:20 node2 kernel: [690741.179442] block drbd0: peer( Primary -&gt; Secondary )<br>

&gt; Jan 22 10:15:20 node2 corosync[32761]:  [QUORUM] Members[3]: 1760315215 13071578 1017525950<br>

&gt; Jan 22 10:15:20 node2 corosync[32761]:  [MAIN  ] Completed service synchronization, ready to provide service.<br>

&gt;<br>

</div></div>&gt; _______________________________________________<br>

&gt; Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>

&gt; <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>

&gt;<br>

&gt; Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>

&gt; Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

&gt; Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>

<br>

<br>

_______________________________________________<br>

Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>

<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>

<br>

Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>

</blockquote></div><br></div></div>