[Pacemaker] One node thinks everyone is online, but the other node doesn't think so

Sun Jan 25 18:53:52 EST 2015

Hello.

I have a 3-node cluster where node1 and node2 are running
corosync+pacemaker and node3 is running corosync only (for quorum).
Corosync 2.3.3, pacemaker 1.1.10. Everything worked fine the first couple
of days.

Once upon a time I discovered the following situation: node2 thinks that
both node1 and node2 are online, but node1 thinks that node2 is down. Could
you please say: how could it be? There are no connectivity problems between
the nodes at the moment (maybe they were, but why the system hasn't
recovered?). The "crm status" is below. What other logs should I attach for
the diagnostics?

Also, "service corosync stop" on node1 hangs forever with no additional
lines in logs, so I cannot even stop the service. (But after "service
corosync stop" on node1 the node node2 starts thinking that node1 is
offline, although the command still hangs.)

root at node2:~# crm status
Current DC: node1 (1760315215) - partition with quorum
2 Nodes configured
6 Resources configured
Online: [ node1 node2 ]
Master/Slave Set: ms_drbd [drbd]
     Masters: [ node2 ]
     Slaves: [ node1 ]
Resource Group: server
     fs (ocf::heartbeat:Filesystem): Started node2
     postgresql (lsb:postgresql): Started node2
     bind9 (lsb:bind9): Started node2
     nginx (lsb:nginx): Started node2

root at node1:/var/log# crm status
Current DC: node1 (1760315215) - partition with quorum
2 Nodes configured
6 Resources configured
Online: [ node1 ]
OFFLINE: [ node2 ]
Master/Slave Set: ms_drbd [drbd]
     Masters: [ node1 ]
     Stopped: [ node2 ]
Resource Group: server
     fs (ocf::heartbeat:Filesystem): Started node1
     postgresql (lsb:postgresql): Started node1
     bind9 (lsb:bind9): Started node1
     nginx (lsb:nginx): Started node1
Failed actions:
    drbd_promote_0 (node=node1, call=634, rc=1, status=Timed Out,
last-rc-change=Thu Jan 22 10:30:08 2015, queued=20004ms, exec=0ms): unknown
error

A part of "crm configure show":

property $id="cib-bootstrap-options" \
        dc-version="1.1.10-42f2063" \
        cluster-infrastructure="corosync" \
        stonith-enabled="false" \
        last-lrm-refresh="1421250983"
rsc_defaults $id="rsc-options" \
        resource-stickiness="100"

Also I see in logs on node1 (maybe they're related to the issue, maybe not):

Jan 22 10:14:02 node1 pengine[2772]:  warning: pe_fence_node: Node node2 is
unclean because it is partially and/or un-expectedly down
Jan 22 10:14:02 node1 pengine[2772]:  warning: determine_online_status:
Node node2 is unclean
Jan 22 10:14:02 node1 pengine[2772]:  warning: stage6: Node node2 is
unclean!
Jan 22 10:14:02 node1 pengine[2772]:  warning: stage6: YOUR RESOURCES ARE
NOW LIKELY COMPROMISED
Jan 22 10:14:02 node1 pengine[2772]:    error: stage6: ENABLE STONITH TO
KEEP YOUR RESOURCES SAFE

On node2 the logs are:

Jan 22 10:13:57 node2 corosync[32761]:  [TOTEM ] A new membership (
188.166.54.190:6276) was formed. Members left: 1760315215 13071578
Jan 22 10:13:57 node2 crmd[311]:   notice: peer_update_callback: Our peer
on the DC is dead
Jan 22 10:13:57 node2 crmd[311]:   notice: do_state_transition: State
transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION
cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
Jan 22 10:13:57 node2 corosync[32761]:  [QUORUM] This node is within the
non-primary component and will NOT provide any services.
Jan 22 10:13:57 node2 corosync[32761]:  [QUORUM] Members[1]: 1017525950
Jan 22 10:13:57 node2 crmd[311]:   notice: pcmk_quorum_notification:
Membership 6276: quorum lost (1)
Jan 22 10:13:57 node2 crmd[311]:   notice: crm_update_peer_state:
pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was
member)
Jan 22 10:13:57 node2 crmd[311]:   notice: crm_update_peer_state:
pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was
member)
Jan 22 10:13:57 node2 pacemakerd[302]:   notice: pcmk_quorum_notification:
Membership 6276: quorum lost (1)
Jan 22 10:13:57 node2 pacemakerd[302]:   notice: crm_update_peer_state:
pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was
member)
Jan 22 10:13:57 node2 pacemakerd[302]:   notice: crm_update_peer_state:
pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was
member)
Jan 22 10:13:57 node2 corosync[32761]:  [MAIN  ] Completed service
synchronization, ready to provide service.
Jan 22 10:14:01 node2 corosync[32761]:  [TOTEM ] A new membership (
104.236.71.79:6288) was formed. Members joined: 1760315215 13071578
Jan 22 10:14:02 node2 crmd[311]:    error: pcmk_cpg_membership: Node
node1[1760315215] appears to be online even though we think it is dead
Jan 22 10:14:02 node2 crmd[311]:   notice: crm_update_peer_state:
pcmk_cpg_membership: Node node1[1760315215] - state is now member (was lost)
Jan 22 10:14:03 node2 corosync[32761]:  [QUORUM] This node is within the
primary component and will provide service.
Jan 22 10:14:03 node2 corosync[32761]:  [QUORUM] Members[3]: 1760315215
13071578 1017525950
Jan 22 10:14:03 node2 crmd[311]:   notice: pcmk_quorum_notification:
Membership 6288: quorum acquired (3)
Jan 22 10:14:03 node2 pacemakerd[302]:   notice: pcmk_quorum_notification:
Membership 6288: quorum acquired (3)
Jan 22 10:14:03 node2 pacemakerd[302]:   notice: crm_update_peer_state:
pcmk_quorum_notification: Node node1[1760315215] - state is now member (was
lost)
Jan 22 10:14:03 node2 corosync[32761]:  [MAIN  ] Completed service
synchronization, ready to provide service.
Jan 22 10:14:03 node2 crmd[311]:   notice: corosync_node_name: Unable to
get node name for nodeid 13071578
Jan 22 10:14:03 node2 crmd[311]:   notice: crm_update_peer_state:
pcmk_quorum_notification: Node (null)[13071578] - state is now member (was
lost)
Jan 22 10:14:03 node2 pacemakerd[302]:   notice: corosync_node_name: Unable
to get node name for nodeid 13071578
Jan 22 10:14:03 node2 pacemakerd[302]:   notice: crm_update_peer_state:
pcmk_quorum_notification: Node (null)[13071578] - state is now member (was
lost)
Jan 22 10:14:03 node2 crmd[311]:  warning: do_log: FSA: Input I_JOIN_OFFER
from route_message() received in state S_ELECTION
Jan 22 10:14:04 node2 crmd[311]:   notice: do_state_transition: State
transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
origin=do_election_count_vote ]
Jan 22 10:14:05 node2 attrd[310]:   notice: attrd_local_callback: Sending
full refresh (origin=crmd)
Jan 22 10:14:05 node2 attrd[310]:   notice: attrd_trigger_update: Sending
flush op to all hosts for: master-drbd (10000)
Jan 22 10:14:05 node2 attrd[310]:   notice: attrd_trigger_update: Sending
flush op to all hosts for: probe_complete (true)
Jan 22 10:14:05 node2 crmd[311]:   notice: do_state_transition: State
transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE
origin=do_cl_join_finalize_respond ]
Jan 22 10:15:11 node2 corosync[32761]:  [TOTEM ] A new membership (
104.236.71.79:6296) was formed. Members left: 13071578
Jan 22 10:15:14 node2 corosync[32761]:  [TOTEM ] A new membership (
128.199.116.218:6312) was formed. Members joined: 13071578 left: 1760315215
Jan 22 10:15:17 node2 corosync[32761]:  [TOTEM ] A new membership (
104.236.71.79:6324) was formed. Members joined: 1760315215
Jan 22 10:15:19 node2 crmd[311]:   notice: peer_update_callback: Our peer
on the DC is dead
Jan 22 10:15:19 node2 crmd[311]:   notice: do_state_transition: State
transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION
cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
Jan 22 10:15:20 node2 kernel: [690741.179442] block drbd0: peer( Primary ->
Secondary )
Jan 22 10:15:20 node2 corosync[32761]:  [QUORUM] Members[3]: 1760315215
13071578 1017525950
Jan 22 10:15:20 node2 corosync[32761]:  [MAIN  ] Completed service
synchronization, ready to provide service.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20150126/80fe1b3d/attachment-0002.html>