[ClusterLabs] Previous DC fenced prior to integration

Mon Jul 25 14:48:08 EDT 2016

On Mon, Jul 25, 2016 at 11:20 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
> On 07/23/2016 10:14 PM, Nate Clark wrote:
>> On Sat, Jul 23, 2016 at 1:06 AM, Andrei Borzenkov <arvidjaar at gmail.com> wrote:
>>> 23.07.2016 01:37, Nate Clark пишет:
>>>> Hello,
>>>>
>>>> I am running pacemaker 1.1.13 with corosync and think I may have
>>>> encountered a start up timing issue on a two node cluster. I didn't
>>>> notice anything in the changelog for 14 or 15 that looked similar to
>>>> this or open bugs.
>>>>
>>>> The rough out line of what happened:
>>>>
>>>> Module 1 and 2 running
>>>> Module 1 is DC
>>>> Module 2 shuts down
>>>> Module 1 updates node attributes used by resources
>>>> Module 1 shuts down
>>>> Module 2 starts up
>>>> Module 2 votes itself as DC
>>>> Module 1 starts up
>>>> Module 2 sees module 1 in corosync and notices it has quorum
>>>> Module 2 enters policy engine state.
>>>> Module 2 policy engine decides to fence 1
>>>> Module 2 then continues and starts resource on itself based upon the old state
>>>>
>>>> For some reason the integration never occurred and module 2 starts to
>>>> perform actions based on stale state.
>>>>
>>>> Here is the full logs
>>>> Jul 20 16:29:06.376805 module-2 crmd[21969]:   notice: Connecting to
>>>> cluster infrastructure: corosync
>>>> Jul 20 16:29:06.386853 module-2 crmd[21969]:   notice: Could not
>>>> obtain a node name for corosync nodeid 2
>>>> Jul 20 16:29:06.392795 module-2 crmd[21969]:   notice: Defaulting to
>>>> uname -n for the local corosync node name
>>>> Jul 20 16:29:06.403611 module-2 crmd[21969]:   notice: Quorum lost
>>>> Jul 20 16:29:06.409237 module-2 stonith-ng[21965]:   notice: Watching
>>>> for stonith topology changes
>>>> Jul 20 16:29:06.409474 module-2 stonith-ng[21965]:   notice: Added
>>>> 'watchdog' to the device list (1 active devices)
>>>> Jul 20 16:29:06.413589 module-2 stonith-ng[21965]:   notice: Relying
>>>> on watchdog integration for fencing
>>>> Jul 20 16:29:06.416905 module-2 cib[21964]:   notice: Defaulting to
>>>> uname -n for the local corosync node name
>>>> Jul 20 16:29:06.417044 module-2 crmd[21969]:   notice:
>>>> pcmk_quorum_notification: Node module-2[2] - state is now member (was
>>>> (null))
>>>> Jul 20 16:29:06.421821 module-2 crmd[21969]:   notice: Defaulting to
>>>> uname -n for the local corosync node name
>>>> Jul 20 16:29:06.422121 module-2 crmd[21969]:   notice: Notifications disabled
>>>> Jul 20 16:29:06.422149 module-2 crmd[21969]:   notice: Watchdog
>>>> enabled but stonith-watchdog-timeout is disabled
>>>> Jul 20 16:29:06.422286 module-2 crmd[21969]:   notice: The local CRM
>>>> is operational
>>>> Jul 20 16:29:06.422312 module-2 crmd[21969]:   notice: State
>>>> transition S_STARTING -> S_PENDING [ input=I_PENDING
>>>> cause=C_FSA_INTERNAL origin=do_started ]
>>>> Jul 20 16:29:07.416871 module-2 stonith-ng[21965]:   notice: Added
>>>> 'fence_sbd' to the device list (2 active devices)
>>>> Jul 20 16:29:08.418567 module-2 stonith-ng[21965]:   notice: Added
>>>> 'ipmi-1' to the device list (3 active devices)
>>>> Jul 20 16:29:27.423578 module-2 crmd[21969]:  warning: FSA: Input
>>>> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
>>>> Jul 20 16:29:27.424298 module-2 crmd[21969]:   notice: State
>>>> transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
>>>> cause=C_TIMER_POPPED origin=election_timeout_popped ]
>>>> Jul 20 16:29:27.460834 module-2 crmd[21969]:  warning: FSA: Input
>>>> I_ELECTION_DC from do_election_check() received in state S_INTEGRATION
>>>> Jul 20 16:29:27.463794 module-2 crmd[21969]:   notice: Notifications disabled
>>>> Jul 20 16:29:27.463824 module-2 crmd[21969]:   notice: Watchdog
>>>> enabled but stonith-watchdog-timeout is disabled
>>>> Jul 20 16:29:27.473285 module-2 attrd[21967]:   notice: Defaulting to
>>>> uname -n for the local corosync node name
>>>> Jul 20 16:29:27.498464 module-2 pengine[21968]:   notice: Relying on
>>>> watchdog integration for fencing
>>>> Jul 20 16:29:27.498536 module-2 pengine[21968]:   notice: We do not
>>>> have quorum - fencing and resource management disabled
>>>> Jul 20 16:29:27.502272 module-2 pengine[21968]:  warning: Node
>>>> module-1 is unclean!
>>>> Jul 20 16:29:27.502287 module-2 pengine[21968]:   notice: Cannot fence
>>>> unclean nodes until quorum is attained (or no-quorum-policy is set to
>>>> ignore)
>
> The above two messages indicate that module-2 cannot see module-1 at
> startup, therefore it must assume it is potentially misbehaving and must
> be shot. However, since it does not have quorum with only one out of two
> nodes, it must wait until module-1 joins until it can shoot it!
>
> This is a special problem with quorum in a two-node cluster. There are a
> variety of ways to deal with it, but the simplest is to set "two_node:
> 1" in corosync.conf (with corosync 2 or later). This will make each node
> wait for the other at startup, meaning both nodes must be started before
> the cluster can start working, but from that point on, it will assume it
> has quorum, and use fencing to ensure any lost node is really down.

two_node is set to 1 for this system. I understand what you are saying
but what usually happens is S_INTEGRATION occurs after quorum as
achieved and the current DC acknowledges the other node which just
started and then accepts into the cluster. However it looks like
S_POLICY_ENGINE occurred first.

Here is what usually happens.

Jul 22 18:42:48 d25-22-left corosync[17646]:   [TOTEM ] A new
membership (172.20.71.72:84) was formed. Members joined: 2
Jul 22 18:42:48 d25-22-left corosync[17646]:   [VOTEQ ] Waiting for
all cluster members. Current votes: 1 expected_votes: 2
Jul 22 18:42:48 d25-22-left corosync[17646]:   [QUORUM] This node is
within the primary component and will provide service.
Jul 22 18:42:48 d25-22-left corosync[17646]:   [QUORUM] Members[2]: 1 2
Jul 22 18:42:48 d25-22-left corosync[17646]:   [MAIN  ] Completed
service synchronization, ready to provide service.
Jul 22 18:42:48 d25-22-left crmd[17714]:   notice: Membership 84:
quorum acquired (2)
Jul 22 18:42:48 d25-22-left pacemakerd[17687]:   notice: Membership
84: quorum acquired (2)
Jul 22 18:42:48 d25-22-left crmd[17714]:   notice: Could not obtain a
node name for corosync nodeid 2
Jul 22 18:42:48 d25-22-left pacemakerd[17687]:   notice: Could not
obtain a node name for corosync nodeid 2
Jul 22 18:42:48 d25-22-left crmd[17714]:   notice: Could not obtain a
node name for corosync nodeid 2
Jul 22 18:42:48 d25-22-left crmd[17714]:   notice:
pcmk_quorum_notification: Node (null)[2] - state is now member (was
(null))
Jul 22 18:42:48 d25-22-left pacemakerd[17687]:   notice: Could not
obtain a node name for corosync nodeid 2
Jul 22 18:42:48 d25-22-left pacemakerd[17687]:   notice:
pcmk_quorum_notification: Node (null)[2] - state is now member (was
(null))
Jul 22 18:42:49 d25-22-left pacemakerd[17687]:   notice: Could not
obtain a node name for corosync nodeid 2
Jul 22 18:42:49 d25-22-left attrd[17712]:   notice: Could not obtain a
node name for corosync nodeid 2
Jul 22 18:42:49 d25-22-left attrd[17712]:   notice:
crm_update_peer_proc: Node (null)[2] - state is now member (was
(null))
Jul 22 18:42:49 d25-22-left stonith-ng[17710]:   notice: Could not
obtain a node name for corosync nodeid 2
Jul 22 18:42:49 d25-22-left stonith-ng[17710]:   notice:
crm_update_peer_proc: Node (null)[2] - state is now member (was
(null))
Jul 22 18:42:49 d25-22-left stonith-ng[17710]:   notice: Defaulting to
uname -n for the local corosync node name
Jul 22 18:42:49 d25-22-left cib[17709]:   notice: Could not obtain a
node name for corosync nodeid 2
Jul 22 18:42:49 d25-22-left cib[17709]:   notice:
crm_update_peer_proc: Node (null)[2] - state is now member (was
(null))
Jul 22 18:42:50 d25-22-left crmd[17714]:   notice: Could not obtain a
node name for corosync nodeid 2
Jul 22 18:42:51 d25-22-left crmd[17714]:   notice: State transition
S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_HA_MESSAGE
origin=route_message ]

In this case the node which came up was not fenced and instead joined
the cluster and since it actually was the DC previously had the
updated attributes and state synced to the current DC.

>>>> Jul 20 16:29:27.503521 module-2 pengine[21968]:   notice: Start
>>>> fence_sbd        (module-2 - blocked)
>>>> Jul 20 16:29:27.503539 module-2 pengine[21968]:   notice: Start
>>>> ipmi-1        (module-2 - blocked)
>>>> Jul 20 16:29:27.503559 module-2 pengine[21968]:   notice: Start
>>>> SlaveIP        (module-2 - blocked)
>>>> Jul 20 16:29:27.503582 module-2 pengine[21968]:   notice: Start
>>>> postgres:0        (module-2 - blocked)
>>>> Jul 20 16:29:27.503597 module-2 pengine[21968]:   notice: Start
>>>> ethmonitor:0        (module-2 - blocked)
>>>> Jul 20 16:29:27.503618 module-2 pengine[21968]:   notice: Start
>>>> tomcat-instance:0        (module-2 - blocked)
>>>> Jul 20 16:29:27.503629 module-2 pengine[21968]:   notice: Start
>>>> ClusterMonitor:0        (module-2 - blocked)
>>>> Jul 20 16:29:27.506945 module-2 pengine[21968]:  warning: Calculated
>>>> Transition 0: /var/lib/pacemaker/pengine/pe-warn-0.bz2
>>>> Jul 20 16:29:27.507976 module-2 crmd[21969]:   notice: Initiating
>>>> action 4: monitor fence_sbd_monitor_0 on module-2 (local)
>>>> Jul 20 16:29:27.509282 module-2 crmd[21969]:   notice: Initiating
>>>> action 5: monitor ipmi-1_monitor_0 on module-2 (local)
>>>> Jul 20 16:29:27.511839 module-2 crmd[21969]:   notice: Initiating
>>>> action 6: monitor ipmi-2_monitor_0 on module-2 (local)
>>>> Jul 20 16:29:27.512629 module-2 crmd[21969]:   notice: Initiating
>>>> action 7: monitor MasterIP_monitor_0 on module-2 (local)
>>>> Jul 20 16:29:27.513340 module-2 crmd[21969]:   notice: Initiating
>>>> action 8: monitor SlaveIP_monitor_0 on module-2 (local)
>>>> Jul 20 16:29:27.531174 module-2 crmd[21969]:   notice: Initiating
>>>> action 9: monitor postgres:0_monitor_0 on module-2 (local)
>>>> Jul 20 16:29:27.532708 module-2 crmd[21969]:   notice: Initiating
>>>> action 10: monitor ethmonitor:0_monitor_0 on module-2 (local)
>>>> Jul 20 16:29:27.534246 module-2 crmd[21969]:   notice: Initiating
>>>> action 11: monitor tomcat-instance:0_monitor_0 on module-2 (local)
>>>> Jul 20 16:29:27.535713 module-2 crmd[21969]:   notice: Initiating
>>>> action 12: monitor ClusterMonitor:0_monitor_0 on module-2 (local)
>>>> Jul 20 16:29:27.537077 module-2 crmd[21969]:   notice: Initiating
>>>> action 13: monitor N1F1_monitor_0 on module-2 (local)
>>>> Jul 20 16:29:27.538388 module-2 crmd[21969]:   notice: Initiating
>>>> action 14: monitor N2F1_monitor_0 on module-2 (local)
>>>> Jul 20 16:29:28.541535 module-2 crmd[21969]:   notice: Operation
>>>> fence_sbd_monitor_0: not running (node=module-2, call=5, rc=7,
>>>> cib-update=28, confirmed=true)
>>>> Jul 20 16:29:29.543752 module-2 crmd[21969]:   notice: Operation
>>>> ipmi-1_monitor_0: not running (node=module-2, call=9, rc=7,
>>>> cib-update=29, confirmed=true)
>>>> Jul 20 16:29:29.544399 module-2 crmd[21969]:   notice: Operation
>>>> ipmi-2_monitor_0: not running (node=module-2, call=13, rc=7,
>>>> cib-update=30, confirmed=true)
>>>> Jul 20 16:29:29.544723 module-2 crmd[21969]:   notice: Transition
>>>> aborted by status-2-master-postgres, master-postgres=-INFINITY:
>>>> Transient attribute change (create cib=1.49.6,
>>>> source=abort_unless_down:319,
>>>> path=/cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2'],
>>>> 0)
>>>> Jul 20 16:29:29.562826 module-2 crmd[21969]:   notice: Operation
>>>> ClusterMonitor_monitor_0: not running (node=module-2, call=41, rc=7,
>>>> cib-update=31, confirmed=true)
>>>> Jul 20 16:29:29.582162 module-2 crmd[21969]:   notice: Operation
>>>> ethmonitor_monitor_0: not running (node=module-2, call=31, rc=7,
>>>> cib-update=32, confirmed=true)
>>>> Jul 20 16:29:29.608400 module-2 crmd[21969]:    error:
>>>> pcmkRegisterNode: Triggered assert at xml.c:594 : node->type ==
>>>> XML_ELEMENT_NODE
>>>> Jul 20 16:29:29.608936 module-2 crmd[21969]:   notice: Operation
>>>> tomcat-instance_monitor_0: not running (node=module-2, call=36, rc=7,
>>>> cib-update=33, confirmed=true)
>>>> Jul 20 16:29:29.632033 module-2 crmd[21969]:   notice: Operation
>>>> SlaveIP_monitor_0: not running (node=module-2, call=21, rc=7,
>>>> cib-update=34, confirmed=true)
>>>> Jul 20 16:29:29.632687 module-2 crmd[21969]:   notice: Operation
>>>> N2F1_monitor_0: not running (node=module-2, call=49, rc=7,
>>>> cib-update=35, confirmed=true)
>>>> Jul 20 16:29:29.633311 module-2 crmd[21969]:   notice: Operation
>>>> MasterIP_monitor_0: not running (node=module-2, call=17, rc=7,
>>>> cib-update=36, confirmed=true)
>>>> Jul 20 16:29:29.635112 module-2 crmd[21969]:   notice: Operation
>>>> N1F1_monitor_0: not running (node=module-2, call=45, rc=7,
>>>> cib-update=37, confirmed=true)
>>>> Jul 20 16:29:29.663796 module-2 crmd[21969]:   notice: Operation
>>>> postgres_monitor_0: not running (node=module-2, call=26, rc=7,
>>>> cib-update=38, confirmed=true)
>>>> Jul 20 16:29:29.665570 module-2 crmd[21969]:   notice: Initiating
>>>> action 3: probe_complete probe_complete-module-2 on module-2 (local) -
>>>> no waiting
>>>> Jul 20 16:29:29.665725 module-2 crmd[21969]:   notice: Transition 0
>>>> (Complete=13, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>>>> Source=/var/lib/pacemaker/pengine/pe-warn-0.bz2): Complete
>>>> Jul 20 16:29:29.675495 module-2 pengine[21968]:   notice: Relying on
>>>> watchdog integration for fencing
>>>> Jul 20 16:29:29.675536 module-2 pengine[21968]:   notice: We do not
>>>> have quorum - fencing and resource management disabled
>>>> Jul 20 16:29:29.678212 module-2 pengine[21968]:  warning: Node
>>>> module-1 is unclean!
>>>> Jul 20 16:29:29.678224 module-2 pengine[21968]:   notice: Cannot fence
>>>> unclean nodes until quorum is attained (or no-quorum-policy is set to
>>>> ignore)
>>>> Jul 20 16:29:29.679220 module-2 pengine[21968]:   notice: Start
>>>> fence_sbd        (module-2 - blocked)
>>>> Jul 20 16:29:29.679236 module-2 pengine[21968]:   notice: Start
>>>> ipmi-1        (module-2 - blocked)
>>>> Jul 20 16:29:29.679251 module-2 pengine[21968]:   notice: Start
>>>> SlaveIP        (module-2 - blocked)
>>>> Jul 20 16:29:29.679267 module-2 pengine[21968]:   notice: Start
>>>> postgres:0        (module-2 - blocked)
>>>> Jul 20 16:29:29.679281 module-2 pengine[21968]:   notice: Start
>>>> ethmonitor:0        (module-2 - blocked)
>>>> Jul 20 16:29:29.679298 module-2 pengine[21968]:   notice: Start
>>>> tomcat-instance:0        (module-2 - blocked)
>>>> Jul 20 16:29:29.679308 module-2 pengine[21968]:   notice: Start
>>>> ClusterMonitor:0        (module-2 - blocked)
>>>> Jul 20 16:29:29.680189 module-2 pengine[21968]:  warning: Calculated
>>>> Transition 1: /var/lib/pacemaker/pengine/pe-warn-1.bz2
>>>> Jul 20 16:29:29.713932 module-2 pengine[21968]:   notice: Relying on
>>>> watchdog integration for fencing
>>>> Jul 20 16:29:29.713975 module-2 pengine[21968]:   notice: We do not
>>>> have quorum - fencing and resource management disabled
>>>> Jul 20 16:29:29.716511 module-2 pengine[21968]:  warning: Node
>>>> module-1 is unclean!
>>>> Jul 20 16:29:29.716524 module-2 pengine[21968]:   notice: Cannot fence
>>>> unclean nodes until quorum is attained (or no-quorum-policy is set to
>>>> ignore)
>>>> Jul 20 16:29:29.717480 module-2 pengine[21968]:   notice: Start
>>>> fence_sbd        (module-2 - blocked)
>>>> Jul 20 16:29:29.717496 module-2 pengine[21968]:   notice: Start
>>>> ipmi-1        (module-2 - blocked)
>>>> Jul 20 16:29:29.717513 module-2 pengine[21968]:   notice: Start
>>>> SlaveIP        (module-2 - blocked)
>>>> Jul 20 16:29:29.717527 module-2 pengine[21968]:   notice: Start
>>>> postgres:0        (module-2 - blocked)
>>>> Jul 20 16:29:29.717542 module-2 pengine[21968]:   notice: Start
>>>> ethmonitor:0        (module-2 - blocked)
>>>> Jul 20 16:29:29.717560 module-2 pengine[21968]:   notice: Start
>>>> tomcat-instance:0        (module-2 - blocked)
>>>> Jul 20 16:29:29.717570 module-2 pengine[21968]:   notice: Start
>>>> ClusterMonitor:0        (module-2 - blocked)
>>>> Jul 20 16:29:29.718349 module-2 pengine[21968]:  warning: Calculated
>>>> Transition 2: /var/lib/pacemaker/pengine/pe-warn-2.bz2
>>>> Jul 20 16:29:29.718377 module-2 crmd[21969]:   notice: Transition 2
>>>> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>>>> Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Complete
>>>> Jul 20 16:29:29.718407 module-2 crmd[21969]:   notice: State
>>>> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
>>>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>>>> Jul 20 16:29:31.509099 module-2 corosync[7632]:   [TOTEM ] A new
>>>> membership (10.1.1.1:52) was formed. Members joined: 1
>>>> Jul 20 16:29:31.511730 module-2 corosync[7632]:   [QUORUM] This node
>>>> is within the primary component and will provide service.
>>>> Jul 20 16:29:31.511745 module-2 corosync[7632]:   [QUORUM] Members[2]: 1 2
>>>> Jul 20 16:29:31.511752 module-2 corosync[7632]:   [MAIN  ] Completed
>>>> service synchronization, ready to provide service.
>>>> Jul 20 16:29:31.511775 module-2 crmd[21969]:   notice: Membership 52:
>>>> quorum acquired (2)
>>>> Jul 20 16:29:31.511776 module-2 pacemakerd[21942]:   notice:
>>>> Membership 52: quorum acquired (2)
>>>> Jul 20 16:29:31.520800 module-2 crmd[21969]:   notice: Could not
>>>> obtain a node name for corosync nodeid 1
>>>> Jul 20 16:29:31.521734 module-2 pacemakerd[21942]:   notice: Could not
>>>> obtain a node name for corosync nodeid 1
>>>> Jul 20 16:29:31.539205 module-2 crmd[21969]:   notice: Could not
>>>> obtain a node name for corosync nodeid 1
>>>> Jul 20 16:29:31.539224 module-2 crmd[21969]:   notice:
>>>> pcmk_quorum_notification: Node (null)[1] - state is now member (was
>>>> (null))
>>>
>>> It seems that pacemaker does not know what node came up. Same error was
>>> for a local node BTW.
>>
>> If you are referring to the message "Could not obtain a node name"
>> those seem to be normal on this system and occur on normal system
>> start up, or was there something else you were referring to?
>
> I believe this is controlled by whether node names are explicitly listed
> in corosync.conf (ring0_addr etc.). It should work either way.
>
>>>> Jul 20 16:29:31.540006 module-2 pacemakerd[21942]:   notice: Could not
>>>> obtain a node name for corosync nodeid 1
>>>> Jul 20 16:29:31.540026 module-2 pacemakerd[21942]:   notice:
>>>> pcmk_quorum_notification: Node (null)[1] - state is now member (was
>>>> (null))
>>>> Jul 20 16:29:36.578489 module-2 crmd[21969]:   notice: State
>>>> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
>>>> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
>>>> Jul 20 16:29:37.583970 module-2 crmd[21969]:   notice: Notifications disabled
>>>> Jul 20 16:29:37.584003 module-2 crmd[21969]:   notice: Watchdog
>>>> enabled but stonith-watchdog-timeout is disabled
>>>> Jul 20 16:29:37.594061 module-2 pengine[21968]:   notice: Relying on
>>>> watchdog integration for fencing
>>>> Jul 20 16:29:37.596990 module-2 pengine[21968]:  warning: Scheduling
>>>> Node module-1 for STONITH
>>>>
>>>> I don't think this is common since I have never seen it before and
>>>> have performed a lot of system boots where both nodes were down.
>>>>
>>>> Thanks,
>>>> -nate
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org