[ClusterLabs] Resources not always stopped when quorum lost
Matt Rideout
mrideout at windserve.com
Thu May 28 15:39:27 UTC 2015
I'm attempting to upgrade a two node cluster with no quorum requirement
to a three node cluster with a two member quorum requirement. Each node
is running CentOS 7, Pacemaker 1.1.12-22 and Crosync 2.3.4-4.
If a node that's running resources loses quorum, then I want it to stop
all of its resources. The goal was partially accomplished by setting
the following in corosync.conf:
quorum {
provider: corosync_votequorum
two_node: 1
}
...and updating Pacemaker's configuration with:
pcs property set no-quorum-policy=stop
With the above configuration, Two failure scenarios work as I would expect:
1. If I power up a single node, it sees that there is no quorum, and
refuses to start any resources until it sees a second node come up.
2. If there are two nodes running, and I power down a node that's
running resources, the other node sees that it lost quorum, and refuses
to start any resources.
However, a third failure scenario does not work as I would expect:
3. If there are two nodes running, and I power down a node that's not
running resources, the node that is running resources notes in its log
that it lost quorum, but does not actually shutdown any of its running
services.
Any ideas on what the problem may be would be greatly appreciated. It in
case it helps, I included the output of "pcs status", "pcs config show",
the contents of "corosync.conf", and the pacemaker and corosync logs
from the period during which resources were not stopped.
*"pcs status" shows the resources still running after quorum is lost:*
Cluster name:
Last updated: Thu May 28 10:27:47 2015
Last change: Thu May 28 10:03:05 2015
Stack: corosync
Current DC: node1 (1) - partition WITHOUT quorum
Version: 1.1.12-a14efad
3 Nodes configured
12 Resources configured
Node node3 (3): OFFLINE (standby)
Online: [ node1 ]
OFFLINE: [ node2 ]
Full list of resources:
Resource Group: primary
virtual_ip_primary (ocf::heartbeat:IPaddr2): Started node1
GreenArrowFS (ocf::heartbeat:Filesystem): Started node1
GreenArrow (ocf::drh:greenarrow): Started node1
virtual_ip_1 (ocf::heartbeat:IPaddr2): Started node1
virtual_ip_2 (ocf::heartbeat:IPaddr2): Started node1
Resource Group: secondary
virtual_ip_secondary (ocf::heartbeat:IPaddr2): Stopped
GreenArrow-Secondary (ocf::drh:greenarrow-secondary): Stopped
Clone Set: ping-clone [ping]
Started: [ node1 ]
Stopped: [ node2 node3 ]
Master/Slave Set: GreenArrowDataClone [GreenArrowData]
Masters: [ node1 ]
Stopped: [ node2 node3 ]
PCSD Status:
node1: Online
node2: Offline
node3: Offline
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
*"pcs config show"**shows that the "no-quorum-policy: stop" setting is
in place:*
Cluster Name:
Corosync Nodes:
node1 node2 node3
Pacemaker Nodes:
node1 node2 node3
Resources:
Group: primary
Resource: virtual_ip_primary (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=10.10.10.1 cidr_netmask=32
Operations: start interval=0s timeout=20s
(virtual_ip_primary-start-timeout-20s)
stop interval=0s timeout=20s
(virtual_ip_primary-stop-timeout-20s)
monitor interval=30s
(virtual_ip_primary-monitor-interval-30s)
Resource: GreenArrowFS (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=/dev/drbd1 directory=/media/drbd1 fstype=xfs
options=noatime,discard
Operations: start interval=0s timeout=60 (GreenArrowFS-start-timeout-60)
stop interval=0s timeout=60 (GreenArrowFS-stop-timeout-60)
monitor interval=20 timeout=40
(GreenArrowFS-monitor-interval-20)
Resource: GreenArrow (class=ocf provider=drh type=greenarrow)
Operations: start interval=0s timeout=30 (GreenArrow-start-timeout-30)
stop interval=0s timeout=240 (GreenArrow-stop-timeout-240)
monitor interval=10 timeout=20
(GreenArrow-monitor-interval-10)
Resource: virtual_ip_1 (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=64.21.76.51 cidr_netmask=32
Operations: start interval=0s timeout=20s
(virtual_ip_1-start-timeout-20s)
stop interval=0s timeout=20s (virtual_ip_1-stop-timeout-20s)
monitor interval=30s (virtual_ip_1-monitor-interval-30s)
Resource: virtual_ip_2 (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=64.21.76.63 cidr_netmask=32
Operations: start interval=0s timeout=20s
(virtual_ip_2-start-timeout-20s)
stop interval=0s timeout=20s (virtual_ip_2-stop-timeout-20s)
monitor interval=30s (virtual_ip_2-monitor-interval-30s)
Group: secondary
Resource: virtual_ip_secondary (class=ocf provider=heartbeat
type=IPaddr2)
Attributes: ip=10.10.10.4 cidr_netmask=32
Operations: start interval=0s timeout=20s
(virtual_ip_secondary-start-timeout-20s)
stop interval=0s timeout=20s
(virtual_ip_secondary-stop-timeout-20s)
monitor interval=30s
(virtual_ip_secondary-monitor-interval-30s)
Resource: GreenArrow-Secondary (class=ocf provider=drh
type=greenarrow-secondary)
Operations: start interval=0s timeout=30
(GreenArrow-Secondary-start-timeout-30)
stop interval=0s timeout=240
(GreenArrow-Secondary-stop-timeout-240)
monitor interval=10 timeout=20
(GreenArrow-Secondary-monitor-interval-10)
Clone: ping-clone
Resource: ping (class=ocf provider=pacemaker type=ping)
Attributes: dampen=30s multiplier=1000 host_list=64.21.76.1
Operations: start interval=0s timeout=60 (ping-start-timeout-60)
stop interval=0s timeout=20 (ping-stop-timeout-20)
monitor interval=10 timeout=60 (ping-monitor-interval-10)
Master: GreenArrowDataClone
Meta Attrs: master-max=1 master-node-max=1 clone-max=2
clone-node-max=1 notify=true
Resource: GreenArrowData (class=ocf provider=linbit type=drbd)
Attributes: drbd_resource=r0
Operations: start interval=0s timeout=240
(GreenArrowData-start-timeout-240)
promote interval=0s timeout=90
(GreenArrowData-promote-timeout-90)
demote interval=0s timeout=90
(GreenArrowData-demote-timeout-90)
stop interval=0s timeout=100
(GreenArrowData-stop-timeout-100)
monitor interval=60s (GreenArrowData-monitor-interval-60s)
Stonith Devices:
Fencing Levels:
Location Constraints:
Resource: primary
Enabled on: node1 (score:INFINITY) (id:location-primary-node1-INFINITY)
Constraint: location-primary
Rule: score=-INFINITY boolean-op=or (id:location-primary-rule)
Expression: pingd lt 1 (id:location-primary-rule-expr)
Expression: not_defined pingd (id:location-primary-rule-expr-1)
Ordering Constraints:
promote GreenArrowDataClone then start GreenArrowFS (kind:Mandatory)
(id:order-GreenArrowDataClone-GreenArrowFS-mandatory)
stop GreenArrowFS then demote GreenArrowDataClone (kind:Mandatory)
(id:order-GreenArrowFS-GreenArrowDataClone-mandatory)
Colocation Constraints:
GreenArrowFS with GreenArrowDataClone (score:INFINITY)
(with-rsc-role:Master)
(id:colocation-GreenArrowFS-GreenArrowDataClone-INFINITY)
virtual_ip_secondary with GreenArrowDataClone (score:INFINITY)
(with-rsc-role:Slave)
(id:colocation-virtual_ip_secondary-GreenArrowDataClone-INFINITY)
virtual_ip_primary with GreenArrowDataClone (score:INFINITY)
(with-rsc-role:Master)
(id:colocation-virtual_ip_primary-GreenArrowDataClone-INFINITY)
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: cluster_greenarrow
dc-version: 1.1.12-a14efad
have-watchdog: false
no-quorum-policy: stop
stonith-enabled: false
Node Attributes:
node3: standby=on
*Here's what was logged*:
May 28 10:19:51 node1 pengine[1296]: notice: stage6: Scheduling Node
node3 for shutdown
May 28 10:19:51 node1 pengine[1296]: notice: process_pe_message:
Calculated Transition 7: /var/lib/pacemaker/pengine/pe-input-992.bz2
May 28 10:19:51 node1 crmd[1297]: notice: run_graph: Transition 7
(Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-992.bz2): Complete
May 28 10:19:51 node1 crmd[1297]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
May 28 10:19:51 node1 crmd[1297]: notice: peer_update_callback:
do_shutdown of node3 (op 64) is complete
May 28 10:19:51 node1 attrd[1295]: notice: crm_update_peer_state:
attrd_peer_change_cb: Node node3[3] - state is now lost (was member)
May 28 10:19:51 node1 attrd[1295]: notice: attrd_peer_remove: Removing
all node3 attributes for attrd_peer_change_cb
May 28 10:19:51 node1 attrd[1295]: notice: attrd_peer_change_cb: Lost
attribute writer node3
May 28 10:19:51 node1 corosync[1040]: [TOTEM ] Membership left list
contains incorrect address. This is sign of misconfiguration between nodes!
May 28 10:19:51 node1 corosync[1040]: [TOTEM ] A new membership
(64.21.76.61:25740) was formed. Members left: 3
May 28 10:19:51 node1 corosync[1040]: [QUORUM] This node is within the
non-primary component and will NOT provide any services.
May 28 10:19:51 node1 corosync[1040]: [QUORUM] Members[1]: 1
May 28 10:19:51 node1 corosync[1040]: [MAIN ] Completed service
synchronization, ready to provide service.
May 28 10:19:51 node1 crmd[1297]: notice: pcmk_quorum_notification:
Membership 25740: quorum lost (1)
May 28 10:19:51 node1 crmd[1297]: notice: crm_update_peer_state:
pcmk_quorum_notification: Node node3[3] - state is now lost (was member)
May 28 10:19:51 node1 crmd[1297]: notice: peer_update_callback:
do_shutdown of node3 (op 64) is complete
May 28 10:19:51 node1 pacemakerd[1254]: notice:
pcmk_quorum_notification: Membership 25740: quorum lost (1)
May 28 10:19:51 node1 pacemakerd[1254]: notice: crm_update_peer_state:
pcmk_quorum_notification: Node node3[3] - state is now lost (was member)
May 28 10:19:52 node1 corosync[1040]: [TOTEM ] Automatically recovered
ring 1
*H**ere's corosync.conf:*
totem {
version: 2
secauth: off
cluster_name: cluster_greenarrow
rrp_mode: passive
transport: udpu
}
nodelist {
node {
ring0_addr: node1
ring1_addr: 10.10.10.2
nodeid: 1
}
node {
ring0_addr: node2
ring1_addr: 10.10.10.3
nodeid: 2
}
node {
ring0_addr: node3
nodeid: 3
}
}
quorum {
provider: corosync_votequorum
two_node: 0
}
logging {
to_syslog: yes
}
Thanks,
Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150528/edd542ba/attachment-0003.html>
More information about the Users
mailing list