[ClusterLabs] Previous DC fenced prior to integration

Nate Clark nate at neworld.us
Sat Jul 23 23:14:14 EDT 2016


On Sat, Jul 23, 2016 at 1:06 AM, Andrei Borzenkov <arvidjaar at gmail.com> wrote:
> 23.07.2016 01:37, Nate Clark пишет:
>> Hello,
>>
>> I am running pacemaker 1.1.13 with corosync and think I may have
>> encountered a start up timing issue on a two node cluster. I didn't
>> notice anything in the changelog for 14 or 15 that looked similar to
>> this or open bugs.
>>
>> The rough out line of what happened:
>>
>> Module 1 and 2 running
>> Module 1 is DC
>> Module 2 shuts down
>> Module 1 updates node attributes used by resources
>> Module 1 shuts down
>> Module 2 starts up
>> Module 2 votes itself as DC
>> Module 1 starts up
>> Module 2 sees module 1 in corosync and notices it has quorum
>> Module 2 enters policy engine state.
>> Module 2 policy engine decides to fence 1
>> Module 2 then continues and starts resource on itself based upon the old state
>>
>> For some reason the integration never occurred and module 2 starts to
>> perform actions based on stale state.
>>
>> Here is the full logs
>> Jul 20 16:29:06.376805 module-2 crmd[21969]:   notice: Connecting to
>> cluster infrastructure: corosync
>> Jul 20 16:29:06.386853 module-2 crmd[21969]:   notice: Could not
>> obtain a node name for corosync nodeid 2
>> Jul 20 16:29:06.392795 module-2 crmd[21969]:   notice: Defaulting to
>> uname -n for the local corosync node name
>> Jul 20 16:29:06.403611 module-2 crmd[21969]:   notice: Quorum lost
>> Jul 20 16:29:06.409237 module-2 stonith-ng[21965]:   notice: Watching
>> for stonith topology changes
>> Jul 20 16:29:06.409474 module-2 stonith-ng[21965]:   notice: Added
>> 'watchdog' to the device list (1 active devices)
>> Jul 20 16:29:06.413589 module-2 stonith-ng[21965]:   notice: Relying
>> on watchdog integration for fencing
>> Jul 20 16:29:06.416905 module-2 cib[21964]:   notice: Defaulting to
>> uname -n for the local corosync node name
>> Jul 20 16:29:06.417044 module-2 crmd[21969]:   notice:
>> pcmk_quorum_notification: Node module-2[2] - state is now member (was
>> (null))
>> Jul 20 16:29:06.421821 module-2 crmd[21969]:   notice: Defaulting to
>> uname -n for the local corosync node name
>> Jul 20 16:29:06.422121 module-2 crmd[21969]:   notice: Notifications disabled
>> Jul 20 16:29:06.422149 module-2 crmd[21969]:   notice: Watchdog
>> enabled but stonith-watchdog-timeout is disabled
>> Jul 20 16:29:06.422286 module-2 crmd[21969]:   notice: The local CRM
>> is operational
>> Jul 20 16:29:06.422312 module-2 crmd[21969]:   notice: State
>> transition S_STARTING -> S_PENDING [ input=I_PENDING
>> cause=C_FSA_INTERNAL origin=do_started ]
>> Jul 20 16:29:07.416871 module-2 stonith-ng[21965]:   notice: Added
>> 'fence_sbd' to the device list (2 active devices)
>> Jul 20 16:29:08.418567 module-2 stonith-ng[21965]:   notice: Added
>> 'ipmi-1' to the device list (3 active devices)
>> Jul 20 16:29:27.423578 module-2 crmd[21969]:  warning: FSA: Input
>> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
>> Jul 20 16:29:27.424298 module-2 crmd[21969]:   notice: State
>> transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
>> cause=C_TIMER_POPPED origin=election_timeout_popped ]
>> Jul 20 16:29:27.460834 module-2 crmd[21969]:  warning: FSA: Input
>> I_ELECTION_DC from do_election_check() received in state S_INTEGRATION
>> Jul 20 16:29:27.463794 module-2 crmd[21969]:   notice: Notifications disabled
>> Jul 20 16:29:27.463824 module-2 crmd[21969]:   notice: Watchdog
>> enabled but stonith-watchdog-timeout is disabled
>> Jul 20 16:29:27.473285 module-2 attrd[21967]:   notice: Defaulting to
>> uname -n for the local corosync node name
>> Jul 20 16:29:27.498464 module-2 pengine[21968]:   notice: Relying on
>> watchdog integration for fencing
>> Jul 20 16:29:27.498536 module-2 pengine[21968]:   notice: We do not
>> have quorum - fencing and resource management disabled
>> Jul 20 16:29:27.502272 module-2 pengine[21968]:  warning: Node
>> module-1 is unclean!
>> Jul 20 16:29:27.502287 module-2 pengine[21968]:   notice: Cannot fence
>> unclean nodes until quorum is attained (or no-quorum-policy is set to
>> ignore)
>> Jul 20 16:29:27.503521 module-2 pengine[21968]:   notice: Start
>> fence_sbd        (module-2 - blocked)
>> Jul 20 16:29:27.503539 module-2 pengine[21968]:   notice: Start
>> ipmi-1        (module-2 - blocked)
>> Jul 20 16:29:27.503559 module-2 pengine[21968]:   notice: Start
>> SlaveIP        (module-2 - blocked)
>> Jul 20 16:29:27.503582 module-2 pengine[21968]:   notice: Start
>> postgres:0        (module-2 - blocked)
>> Jul 20 16:29:27.503597 module-2 pengine[21968]:   notice: Start
>> ethmonitor:0        (module-2 - blocked)
>> Jul 20 16:29:27.503618 module-2 pengine[21968]:   notice: Start
>> tomcat-instance:0        (module-2 - blocked)
>> Jul 20 16:29:27.503629 module-2 pengine[21968]:   notice: Start
>> ClusterMonitor:0        (module-2 - blocked)
>> Jul 20 16:29:27.506945 module-2 pengine[21968]:  warning: Calculated
>> Transition 0: /var/lib/pacemaker/pengine/pe-warn-0.bz2
>> Jul 20 16:29:27.507976 module-2 crmd[21969]:   notice: Initiating
>> action 4: monitor fence_sbd_monitor_0 on module-2 (local)
>> Jul 20 16:29:27.509282 module-2 crmd[21969]:   notice: Initiating
>> action 5: monitor ipmi-1_monitor_0 on module-2 (local)
>> Jul 20 16:29:27.511839 module-2 crmd[21969]:   notice: Initiating
>> action 6: monitor ipmi-2_monitor_0 on module-2 (local)
>> Jul 20 16:29:27.512629 module-2 crmd[21969]:   notice: Initiating
>> action 7: monitor MasterIP_monitor_0 on module-2 (local)
>> Jul 20 16:29:27.513340 module-2 crmd[21969]:   notice: Initiating
>> action 8: monitor SlaveIP_monitor_0 on module-2 (local)
>> Jul 20 16:29:27.531174 module-2 crmd[21969]:   notice: Initiating
>> action 9: monitor postgres:0_monitor_0 on module-2 (local)
>> Jul 20 16:29:27.532708 module-2 crmd[21969]:   notice: Initiating
>> action 10: monitor ethmonitor:0_monitor_0 on module-2 (local)
>> Jul 20 16:29:27.534246 module-2 crmd[21969]:   notice: Initiating
>> action 11: monitor tomcat-instance:0_monitor_0 on module-2 (local)
>> Jul 20 16:29:27.535713 module-2 crmd[21969]:   notice: Initiating
>> action 12: monitor ClusterMonitor:0_monitor_0 on module-2 (local)
>> Jul 20 16:29:27.537077 module-2 crmd[21969]:   notice: Initiating
>> action 13: monitor N1F1_monitor_0 on module-2 (local)
>> Jul 20 16:29:27.538388 module-2 crmd[21969]:   notice: Initiating
>> action 14: monitor N2F1_monitor_0 on module-2 (local)
>> Jul 20 16:29:28.541535 module-2 crmd[21969]:   notice: Operation
>> fence_sbd_monitor_0: not running (node=module-2, call=5, rc=7,
>> cib-update=28, confirmed=true)
>> Jul 20 16:29:29.543752 module-2 crmd[21969]:   notice: Operation
>> ipmi-1_monitor_0: not running (node=module-2, call=9, rc=7,
>> cib-update=29, confirmed=true)
>> Jul 20 16:29:29.544399 module-2 crmd[21969]:   notice: Operation
>> ipmi-2_monitor_0: not running (node=module-2, call=13, rc=7,
>> cib-update=30, confirmed=true)
>> Jul 20 16:29:29.544723 module-2 crmd[21969]:   notice: Transition
>> aborted by status-2-master-postgres, master-postgres=-INFINITY:
>> Transient attribute change (create cib=1.49.6,
>> source=abort_unless_down:319,
>> path=/cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2'],
>> 0)
>> Jul 20 16:29:29.562826 module-2 crmd[21969]:   notice: Operation
>> ClusterMonitor_monitor_0: not running (node=module-2, call=41, rc=7,
>> cib-update=31, confirmed=true)
>> Jul 20 16:29:29.582162 module-2 crmd[21969]:   notice: Operation
>> ethmonitor_monitor_0: not running (node=module-2, call=31, rc=7,
>> cib-update=32, confirmed=true)
>> Jul 20 16:29:29.608400 module-2 crmd[21969]:    error:
>> pcmkRegisterNode: Triggered assert at xml.c:594 : node->type ==
>> XML_ELEMENT_NODE
>> Jul 20 16:29:29.608936 module-2 crmd[21969]:   notice: Operation
>> tomcat-instance_monitor_0: not running (node=module-2, call=36, rc=7,
>> cib-update=33, confirmed=true)
>> Jul 20 16:29:29.632033 module-2 crmd[21969]:   notice: Operation
>> SlaveIP_monitor_0: not running (node=module-2, call=21, rc=7,
>> cib-update=34, confirmed=true)
>> Jul 20 16:29:29.632687 module-2 crmd[21969]:   notice: Operation
>> N2F1_monitor_0: not running (node=module-2, call=49, rc=7,
>> cib-update=35, confirmed=true)
>> Jul 20 16:29:29.633311 module-2 crmd[21969]:   notice: Operation
>> MasterIP_monitor_0: not running (node=module-2, call=17, rc=7,
>> cib-update=36, confirmed=true)
>> Jul 20 16:29:29.635112 module-2 crmd[21969]:   notice: Operation
>> N1F1_monitor_0: not running (node=module-2, call=45, rc=7,
>> cib-update=37, confirmed=true)
>> Jul 20 16:29:29.663796 module-2 crmd[21969]:   notice: Operation
>> postgres_monitor_0: not running (node=module-2, call=26, rc=7,
>> cib-update=38, confirmed=true)
>> Jul 20 16:29:29.665570 module-2 crmd[21969]:   notice: Initiating
>> action 3: probe_complete probe_complete-module-2 on module-2 (local) -
>> no waiting
>> Jul 20 16:29:29.665725 module-2 crmd[21969]:   notice: Transition 0
>> (Complete=13, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>> Source=/var/lib/pacemaker/pengine/pe-warn-0.bz2): Complete
>> Jul 20 16:29:29.675495 module-2 pengine[21968]:   notice: Relying on
>> watchdog integration for fencing
>> Jul 20 16:29:29.675536 module-2 pengine[21968]:   notice: We do not
>> have quorum - fencing and resource management disabled
>> Jul 20 16:29:29.678212 module-2 pengine[21968]:  warning: Node
>> module-1 is unclean!
>> Jul 20 16:29:29.678224 module-2 pengine[21968]:   notice: Cannot fence
>> unclean nodes until quorum is attained (or no-quorum-policy is set to
>> ignore)
>> Jul 20 16:29:29.679220 module-2 pengine[21968]:   notice: Start
>> fence_sbd        (module-2 - blocked)
>> Jul 20 16:29:29.679236 module-2 pengine[21968]:   notice: Start
>> ipmi-1        (module-2 - blocked)
>> Jul 20 16:29:29.679251 module-2 pengine[21968]:   notice: Start
>> SlaveIP        (module-2 - blocked)
>> Jul 20 16:29:29.679267 module-2 pengine[21968]:   notice: Start
>> postgres:0        (module-2 - blocked)
>> Jul 20 16:29:29.679281 module-2 pengine[21968]:   notice: Start
>> ethmonitor:0        (module-2 - blocked)
>> Jul 20 16:29:29.679298 module-2 pengine[21968]:   notice: Start
>> tomcat-instance:0        (module-2 - blocked)
>> Jul 20 16:29:29.679308 module-2 pengine[21968]:   notice: Start
>> ClusterMonitor:0        (module-2 - blocked)
>> Jul 20 16:29:29.680189 module-2 pengine[21968]:  warning: Calculated
>> Transition 1: /var/lib/pacemaker/pengine/pe-warn-1.bz2
>> Jul 20 16:29:29.713932 module-2 pengine[21968]:   notice: Relying on
>> watchdog integration for fencing
>> Jul 20 16:29:29.713975 module-2 pengine[21968]:   notice: We do not
>> have quorum - fencing and resource management disabled
>> Jul 20 16:29:29.716511 module-2 pengine[21968]:  warning: Node
>> module-1 is unclean!
>> Jul 20 16:29:29.716524 module-2 pengine[21968]:   notice: Cannot fence
>> unclean nodes until quorum is attained (or no-quorum-policy is set to
>> ignore)
>> Jul 20 16:29:29.717480 module-2 pengine[21968]:   notice: Start
>> fence_sbd        (module-2 - blocked)
>> Jul 20 16:29:29.717496 module-2 pengine[21968]:   notice: Start
>> ipmi-1        (module-2 - blocked)
>> Jul 20 16:29:29.717513 module-2 pengine[21968]:   notice: Start
>> SlaveIP        (module-2 - blocked)
>> Jul 20 16:29:29.717527 module-2 pengine[21968]:   notice: Start
>> postgres:0        (module-2 - blocked)
>> Jul 20 16:29:29.717542 module-2 pengine[21968]:   notice: Start
>> ethmonitor:0        (module-2 - blocked)
>> Jul 20 16:29:29.717560 module-2 pengine[21968]:   notice: Start
>> tomcat-instance:0        (module-2 - blocked)
>> Jul 20 16:29:29.717570 module-2 pengine[21968]:   notice: Start
>> ClusterMonitor:0        (module-2 - blocked)
>> Jul 20 16:29:29.718349 module-2 pengine[21968]:  warning: Calculated
>> Transition 2: /var/lib/pacemaker/pengine/pe-warn-2.bz2
>> Jul 20 16:29:29.718377 module-2 crmd[21969]:   notice: Transition 2
>> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>> Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Complete
>> Jul 20 16:29:29.718407 module-2 crmd[21969]:   notice: State
>> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>> Jul 20 16:29:31.509099 module-2 corosync[7632]:   [TOTEM ] A new
>> membership (10.1.1.1:52) was formed. Members joined: 1
>> Jul 20 16:29:31.511730 module-2 corosync[7632]:   [QUORUM] This node
>> is within the primary component and will provide service.
>> Jul 20 16:29:31.511745 module-2 corosync[7632]:   [QUORUM] Members[2]: 1 2
>> Jul 20 16:29:31.511752 module-2 corosync[7632]:   [MAIN  ] Completed
>> service synchronization, ready to provide service.
>> Jul 20 16:29:31.511775 module-2 crmd[21969]:   notice: Membership 52:
>> quorum acquired (2)
>> Jul 20 16:29:31.511776 module-2 pacemakerd[21942]:   notice:
>> Membership 52: quorum acquired (2)
>> Jul 20 16:29:31.520800 module-2 crmd[21969]:   notice: Could not
>> obtain a node name for corosync nodeid 1
>> Jul 20 16:29:31.521734 module-2 pacemakerd[21942]:   notice: Could not
>> obtain a node name for corosync nodeid 1
>> Jul 20 16:29:31.539205 module-2 crmd[21969]:   notice: Could not
>> obtain a node name for corosync nodeid 1
>> Jul 20 16:29:31.539224 module-2 crmd[21969]:   notice:
>> pcmk_quorum_notification: Node (null)[1] - state is now member (was
>> (null))
>
> It seems that pacemaker does not know what node came up. Same error was
> for a local node BTW.

If you are referring to the message "Could not obtain a node name"
those seem to be normal on this system and occur on normal system
start up, or was there something else you were referring to?

>> Jul 20 16:29:31.540006 module-2 pacemakerd[21942]:   notice: Could not
>> obtain a node name for corosync nodeid 1
>> Jul 20 16:29:31.540026 module-2 pacemakerd[21942]:   notice:
>> pcmk_quorum_notification: Node (null)[1] - state is now member (was
>> (null))
>> Jul 20 16:29:36.578489 module-2 crmd[21969]:   notice: State
>> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
>> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
>> Jul 20 16:29:37.583970 module-2 crmd[21969]:   notice: Notifications disabled
>> Jul 20 16:29:37.584003 module-2 crmd[21969]:   notice: Watchdog
>> enabled but stonith-watchdog-timeout is disabled
>> Jul 20 16:29:37.594061 module-2 pengine[21968]:   notice: Relying on
>> watchdog integration for fencing
>> Jul 20 16:29:37.596990 module-2 pengine[21968]:  warning: Scheduling
>> Node module-1 for STONITH
>>
>> I don't think this is common since I have never seen it before and
>> have performed a lot of system boots where both nodes were down.
>>
>> Thanks,
>> -nate
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org




More information about the Users mailing list