[ClusterLabs] Resources not always stopped when quorum lost
Matt Rideout
mrideout at windserve.com
Thu May 28 18:22:35 UTC 2015
It turns out that if I wait, the node that has resources already started
when a quorum is lost does stop its resources after 15 minutes. I
repeated the test, and saw the same 15-minute delay.
cluster-recheck-interval is set to 15 minutes by default, so I dropped
it to 1 minute with:
pcs property set cluster-recheck-interval="60"
This successfully reduced the delay to 1 minute.
Is it normal for Pacemaker to wait for cluster-recheck-interval before
shutting down resources that were already running at the time quorum was
lost?
Thanks,
Matt
On 5/28/15 11:39 AM, Matt Rideout wrote:
> I'm attempting to upgrade a two node cluster with no quorum
> requirement to a three node cluster with a two member quorum
> requirement. Each node is running CentOS 7, Pacemaker 1.1.12-22 and
> Crosync 2.3.4-4.
>
> If a node that's running resources loses quorum, then I want it to
> stop all of its resources. The goal was partially accomplished by
> setting the following in corosync.conf:
>
> quorum {
> provider: corosync_votequorum
> two_node: 1
> }
>
> ...and updating Pacemaker's configuration with:
>
> pcs property set no-quorum-policy=stop
>
> With the above configuration, Two failure scenarios work as I would
> expect:
>
> 1. If I power up a single node, it sees that there is no quorum, and
> refuses to start any resources until it sees a second node come up.
>
> 2. If there are two nodes running, and I power down a node that's
> running resources, the other node sees that it lost quorum, and
> refuses to start any resources.
>
> However, a third failure scenario does not work as I would expect:
>
> 3. If there are two nodes running, and I power down a node that's not
> running resources, the node that is running resources notes in its log
> that it lost quorum, but does not actually shutdown any of its running
> services.
>
> Any ideas on what the problem may be would be greatly appreciated. It
> in case it helps, I included the output of "pcs status", "pcs config
> show", the contents of "corosync.conf", and the pacemaker and corosync
> logs from the period during which resources were not stopped.
>
> *"pcs status" shows the resources still running after quorum is lost:*
>
> Cluster name:
> Last updated: Thu May 28 10:27:47 2015
> Last change: Thu May 28 10:03:05 2015
> Stack: corosync
> Current DC: node1 (1) - partition WITHOUT quorum
> Version: 1.1.12-a14efad
> 3 Nodes configured
> 12 Resources configured
>
>
> Node node3 (3): OFFLINE (standby)
> Online: [ node1 ]
> OFFLINE: [ node2 ]
>
> Full list of resources:
>
> Resource Group: primary
> virtual_ip_primary (ocf::heartbeat:IPaddr2): Started node1
> GreenArrowFS (ocf::heartbeat:Filesystem): Started node1
> GreenArrow (ocf::drh:greenarrow): Started node1
> virtual_ip_1 (ocf::heartbeat:IPaddr2): Started node1
> virtual_ip_2 (ocf::heartbeat:IPaddr2): Started node1
> Resource Group: secondary
> virtual_ip_secondary (ocf::heartbeat:IPaddr2): Stopped
> GreenArrow-Secondary (ocf::drh:greenarrow-secondary): Stopped
> Clone Set: ping-clone [ping]
> Started: [ node1 ]
> Stopped: [ node2 node3 ]
> Master/Slave Set: GreenArrowDataClone [GreenArrowData]
> Masters: [ node1 ]
> Stopped: [ node2 node3 ]
>
> PCSD Status:
> node1: Online
> node2: Offline
> node3: Offline
>
> Daemon Status:
> corosync: active/enabled
> pacemaker: active/enabled
> pcsd: active/enabled
>
> *"pcs config show"**shows that the "no-quorum-policy: stop" setting is
> in place:*
>
> Cluster Name:
> Corosync Nodes:
> node1 node2 node3
> Pacemaker Nodes:
> node1 node2 node3
>
> Resources:
> Group: primary
> Resource: virtual_ip_primary (class=ocf provider=heartbeat type=IPaddr2)
> Attributes: ip=10.10.10.1 cidr_netmask=32
> Operations: start interval=0s timeout=20s
> (virtual_ip_primary-start-timeout-20s)
> stop interval=0s timeout=20s
> (virtual_ip_primary-stop-timeout-20s)
> monitor interval=30s
> (virtual_ip_primary-monitor-interval-30s)
> Resource: GreenArrowFS (class=ocf provider=heartbeat type=Filesystem)
> Attributes: device=/dev/drbd1 directory=/media/drbd1 fstype=xfs
> options=noatime,discard
> Operations: start interval=0s timeout=60
> (GreenArrowFS-start-timeout-60)
> stop interval=0s timeout=60 (GreenArrowFS-stop-timeout-60)
> monitor interval=20 timeout=40
> (GreenArrowFS-monitor-interval-20)
> Resource: GreenArrow (class=ocf provider=drh type=greenarrow)
> Operations: start interval=0s timeout=30 (GreenArrow-start-timeout-30)
> stop interval=0s timeout=240 (GreenArrow-stop-timeout-240)
> monitor interval=10 timeout=20
> (GreenArrow-monitor-interval-10)
> Resource: virtual_ip_1 (class=ocf provider=heartbeat type=IPaddr2)
> Attributes: ip=64.21.76.51 cidr_netmask=32
> Operations: start interval=0s timeout=20s
> (virtual_ip_1-start-timeout-20s)
> stop interval=0s timeout=20s
> (virtual_ip_1-stop-timeout-20s)
> monitor interval=30s (virtual_ip_1-monitor-interval-30s)
> Resource: virtual_ip_2 (class=ocf provider=heartbeat type=IPaddr2)
> Attributes: ip=64.21.76.63 cidr_netmask=32
> Operations: start interval=0s timeout=20s
> (virtual_ip_2-start-timeout-20s)
> stop interval=0s timeout=20s
> (virtual_ip_2-stop-timeout-20s)
> monitor interval=30s (virtual_ip_2-monitor-interval-30s)
> Group: secondary
> Resource: virtual_ip_secondary (class=ocf provider=heartbeat
> type=IPaddr2)
> Attributes: ip=10.10.10.4 cidr_netmask=32
> Operations: start interval=0s timeout=20s
> (virtual_ip_secondary-start-timeout-20s)
> stop interval=0s timeout=20s
> (virtual_ip_secondary-stop-timeout-20s)
> monitor interval=30s
> (virtual_ip_secondary-monitor-interval-30s)
> Resource: GreenArrow-Secondary (class=ocf provider=drh
> type=greenarrow-secondary)
> Operations: start interval=0s timeout=30
> (GreenArrow-Secondary-start-timeout-30)
> stop interval=0s timeout=240
> (GreenArrow-Secondary-stop-timeout-240)
> monitor interval=10 timeout=20
> (GreenArrow-Secondary-monitor-interval-10)
> Clone: ping-clone
> Resource: ping (class=ocf provider=pacemaker type=ping)
> Attributes: dampen=30s multiplier=1000 host_list=64.21.76.1
> Operations: start interval=0s timeout=60 (ping-start-timeout-60)
> stop interval=0s timeout=20 (ping-stop-timeout-20)
> monitor interval=10 timeout=60 (ping-monitor-interval-10)
> Master: GreenArrowDataClone
> Meta Attrs: master-max=1 master-node-max=1 clone-max=2
> clone-node-max=1 notify=true
> Resource: GreenArrowData (class=ocf provider=linbit type=drbd)
> Attributes: drbd_resource=r0
> Operations: start interval=0s timeout=240
> (GreenArrowData-start-timeout-240)
> promote interval=0s timeout=90
> (GreenArrowData-promote-timeout-90)
> demote interval=0s timeout=90
> (GreenArrowData-demote-timeout-90)
> stop interval=0s timeout=100
> (GreenArrowData-stop-timeout-100)
> monitor interval=60s (GreenArrowData-monitor-interval-60s)
>
> Stonith Devices:
> Fencing Levels:
>
> Location Constraints:
> Resource: primary
> Enabled on: node1 (score:INFINITY)
> (id:location-primary-node1-INFINITY)
> Constraint: location-primary
> Rule: score=-INFINITY boolean-op=or (id:location-primary-rule)
> Expression: pingd lt 1 (id:location-primary-rule-expr)
> Expression: not_defined pingd (id:location-primary-rule-expr-1)
> Ordering Constraints:
> promote GreenArrowDataClone then start GreenArrowFS (kind:Mandatory)
> (id:order-GreenArrowDataClone-GreenArrowFS-mandatory)
> stop GreenArrowFS then demote GreenArrowDataClone (kind:Mandatory)
> (id:order-GreenArrowFS-GreenArrowDataClone-mandatory)
> Colocation Constraints:
> GreenArrowFS with GreenArrowDataClone (score:INFINITY)
> (with-rsc-role:Master)
> (id:colocation-GreenArrowFS-GreenArrowDataClone-INFINITY)
> virtual_ip_secondary with GreenArrowDataClone (score:INFINITY)
> (with-rsc-role:Slave)
> (id:colocation-virtual_ip_secondary-GreenArrowDataClone-INFINITY)
> virtual_ip_primary with GreenArrowDataClone (score:INFINITY)
> (with-rsc-role:Master)
> (id:colocation-virtual_ip_primary-GreenArrowDataClone-INFINITY)
>
> Cluster Properties:
> cluster-infrastructure: corosync
> cluster-name: cluster_greenarrow
> dc-version: 1.1.12-a14efad
> have-watchdog: false
> no-quorum-policy: stop
> stonith-enabled: false
> Node Attributes:
> node3: standby=on
>
> *Here's what was logged*:
>
> May 28 10:19:51 node1 pengine[1296]: notice: stage6: Scheduling Node
> node3 for shutdown
> May 28 10:19:51 node1 pengine[1296]: notice: process_pe_message:
> Calculated Transition 7: /var/lib/pacemaker/pengine/pe-input-992.bz2
> May 28 10:19:51 node1 crmd[1297]: notice: run_graph: Transition 7
> (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-992.bz2): Complete
> May 28 10:19:51 node1 crmd[1297]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> May 28 10:19:51 node1 crmd[1297]: notice: peer_update_callback:
> do_shutdown of node3 (op 64) is complete
> May 28 10:19:51 node1 attrd[1295]: notice: crm_update_peer_state:
> attrd_peer_change_cb: Node node3[3] - state is now lost (was member)
> May 28 10:19:51 node1 attrd[1295]: notice: attrd_peer_remove: Removing
> all node3 attributes for attrd_peer_change_cb
> May 28 10:19:51 node1 attrd[1295]: notice: attrd_peer_change_cb: Lost
> attribute writer node3
> May 28 10:19:51 node1 corosync[1040]: [TOTEM ] Membership left list
> contains incorrect address. This is sign of misconfiguration between
> nodes!
> May 28 10:19:51 node1 corosync[1040]: [TOTEM ] A new membership
> (64.21.76.61:25740) was formed. Members left: 3
> May 28 10:19:51 node1 corosync[1040]: [QUORUM] This node is within the
> non-primary component and will NOT provide any services.
> May 28 10:19:51 node1 corosync[1040]: [QUORUM] Members[1]: 1
> May 28 10:19:51 node1 corosync[1040]: [MAIN ] Completed service
> synchronization, ready to provide service.
> May 28 10:19:51 node1 crmd[1297]: notice: pcmk_quorum_notification:
> Membership 25740: quorum lost (1)
> May 28 10:19:51 node1 crmd[1297]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node node3[3] - state is now lost (was member)
> May 28 10:19:51 node1 crmd[1297]: notice: peer_update_callback:
> do_shutdown of node3 (op 64) is complete
> May 28 10:19:51 node1 pacemakerd[1254]: notice:
> pcmk_quorum_notification: Membership 25740: quorum lost (1)
> May 28 10:19:51 node1 pacemakerd[1254]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node node3[3] - state is now lost (was member)
> May 28 10:19:52 node1 corosync[1040]: [TOTEM ] Automatically recovered
> ring 1
>
> *H**ere's corosync.conf:*
>
> totem {
> version: 2
> secauth: off
> cluster_name: cluster_greenarrow
> rrp_mode: passive
> transport: udpu
> }
>
> nodelist {
> node {
> ring0_addr: node1
> ring1_addr: 10.10.10.2
> nodeid: 1
> }
> node {
> ring0_addr: node2
> ring1_addr: 10.10.10.3
> nodeid: 2
> }
> node {
> ring0_addr: node3
> nodeid: 3
> }
> }
>
> quorum {
> provider: corosync_votequorum
> two_node: 0
> }
>
> logging {
> to_syslog: yes
> }
>
> Thanks,
>
> Matt
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150528/ed2d5145/attachment.htm>
More information about the Users
mailing list