[Pacemaker] pacemaker with cman and dbrd when primary node panics or poweroff
Gianluca Cecchi
gianluca.cecchi at gmail.com
Mon Mar 3 19:09:19 CET 2014
Hello,
I'm testing pacemaker with cman on CentOS 6.5 where I have drbd
resource in classic primary/secondary setup with master/slave config
Relevant packages:
cman-3.0.12.1-59.el6_5.1.x86_64
pacemaker-1.1.10-14.el6_5.2.x86_64
kmod-drbd84-8.4.4-1.el6.elrepo.x86_64
drbd84-utils-8.4.4-2.el6.elrepo.x86_64
kernel 2.6.32-431.5.1.el6.x86_64
>From cman point of view I delegated fencing to pacemaker with
fence_pcmk fence agent in cluster.conf
>From pacemaker point of view:
# pcs cluster cib | grep cib-boot | egrep "quorum|stonith"
<nvpair id="cib-bootstrap-options-stonith-enabled"
name="stonith-enabled" value="false"/>
<nvpair id="cib-bootstrap-options-no-quorum-policy"
name="no-quorum-policy" value="ignore"/>
>From drbd point ov view:
resource res0 {
disk {
disk-flushes no;
md-flushes no;
fencing resource-only;
}
device minor 0;
disk /dev/sdb;
syncer {
rate 30M;
verify-alg md5;
}
handlers {
fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
}
What is the expected behavior if I force a power off of the primary
node where the resource is master?
In my case where I test and power off iclnode01 the status remains:
Last updated: Mon Mar 3 18:37:02 2014
Last change: Mon Mar 3 18:37:02 2014 via crmd on iclnode02
Stack: cman
Current DC: iclnode02 - partition WITHOUT quorum
Version: 1.1.10-14.el6_5.2-368c726
2 Nodes configured
12 Resources configured
Online: [ iclnode02 ]
OFFLINE: [ iclnode01 ]
Master/Slave Set: ms_MyData [MyData]
Slaves: [ iclnode02 ]
Stopped: [ iclnode01 ]
and
# cat /proc/drbd
version: 8.4.4 (api:1/proto:86-101)
GIT-hash: 599f286440bd633d15d5ff985204aff4bccffadd build by
phil at Build64R6, 2013-10-14 15:33:06
0: cs:WFConnection ro:Secondary/Unknown ds:UpToDate/Outdated C r-----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0
In messages I see the crm-fence-peer.sh did its job with putting constraint
Mar 3 18:25:35 node02 kernel: drbd res0: helper command:
/sbin/drbdadm fence-peer res0
Mar 3 18:25:35 node02 crm-fence-peer.sh[7633]: invoked for res0
Mar 3 18:25:35 node02 cibadmin[7664]: notice: crm_log_args:
Invoked: cibadmin -C -o constraints -X <rsc_location rsc="ms_MyData"
id="drbd-fence-by-handler-res0-ms_MyData">#012 <rule role="Master"
score="-INFINITY" id="drbd-fence-by-handler-res0-rule-ms_MyData">#012
<expression attribute="#uname" operation="ne"
value="node02.localdomain.local"
id="drbd-fence-by-handler-res0-expr-ms_MyData"/>#012
</rule>#012</rsc_location>
Mar 3 18:25:35 node02 stonith-ng[1894]: notice: unpack_config: On
loss of CCM Quorum: Ignore
Mar 3 18:25:35 node02 cib[1893]: notice: cib:diff: Diff: --- 0.127.36
Mar 3 18:25:35 node02 cib[1893]: notice: cib:diff: Diff: +++
0.128.1 6e071e71b96b076e87b27c299ba3057d
Mar 3 18:25:35 node02 cib[1893]: notice: cib:diff: -- <cib
admin_epoch="0" epoch="127" num_updates="36"/>
Mar 3 18:25:35 node02 cib[1893]: notice: cib:diff: ++
<rsc_location rsc="ms_MyData"
id="drbd-fence-by-handler-res0-ms_MyData">
Mar 3 18:25:35 node02 cib[1893]: notice: cib:diff: ++ <rule
role="Master" score="-INFINITY"
id="drbd-fence-by-handler-res0-rule-ms_MyData">
Mar 3 18:25:35 node02 cib[1893]: notice: cib:diff: ++
<expression attribute="#uname" operation="ne"
value="node02.localdomain.local"
id="drbd-fence-by-handler-res0-expr-ms_MyData"/>
Mar 3 18:25:35 node02 cib[1893]: notice: cib:diff: ++ </rule>
Mar 3 18:25:35 node02 cib[1893]: notice: cib:diff: ++ </rsc_location>
Mar 3 18:25:35 node02 crm-fence-peer.sh[7633]: INFO peer is not
reachable, my disk is UpToDate: placed constraint
'drbd-fence-by-handler-res0-ms_MyData'
Mar 3 18:25:35 node02 kernel: drbd res0: helper command:
/sbin/drbdadm fence-peer res0 exit code 5 (0x500)
Mar 3 18:25:35 node02 kernel: drbd res0: fence-peer helper returned 5
(peer is unreachable, assumed to be dead)
Mar 3 18:25:35 node02 kernel: drbd res0: pdsk( DUnknown -> Outdated )
Mar 3 18:25:35 node02 kernel: block drbd0: role( Secondary -> Primary )
Mar 3 18:25:35 node02 kernel: block drbd0: new current UUID
03E9D09641694365:B5B5224185905A78:83887B50434B5AB6:83877B50434B5AB6
but soon after it is demoted since "monitor" found it in master...
Mar 3 18:25:35 node02 crmd[1898]: notice: process_lrm_event: LRM
operation MyData_promote_0 (call=305, rc=0, cib-update=90,
confirmed=true) ok
Mar 3 18:25:35 node02 crmd[1898]: notice: te_rsc_command:
Initiating action 54: notify MyData_post_notify_promote_0 on iclnode02
(local)
Mar 3 18:25:36 node02 crmd[1898]: notice: process_lrm_event: LRM
operation MyData_notify_0 (call=308, rc=0, cib-update=0,
confirmed=true) ok
Mar 3 18:25:36 node02 crmd[1898]: notice: run_graph: Transition 1
(Complete=9, Pending=0, Fired=0, Skipped=2, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-995.bz2): Stopped
Mar 3 18:25:36 node02 pengine[1897]: notice: unpack_config: On loss
of CCM Quorum: Ignore
Mar 3 18:25:36 node02 pengine[1897]: notice: unpack_rsc_op:
Operation monitor found resource MyData:0 active in master mode on
iclnode02
Mar 3 18:25:36 node02 pengine[1897]: notice: LogActions: Demote
MyData:0#011(Master -> Slave iclnode02)
Mar 3 18:25:36 node02 pengine[1897]: notice: process_pe_message:
Calculated Transition 2: /var/lib/pacemaker/pengine/pe-input-996.bz2
Mar 3 18:25:36 node02 crmd[1898]: notice: te_rsc_command:
Initiating action 53: notify MyData_pre_notify_demote_0 on iclnode02
(local)
Mar 3 18:25:36 node02 crmd[1898]: notice: process_lrm_event: LRM
operation MyData_notify_0 (call=311, rc=0, cib-update=0,
confirmed=true) ok
Mar 3 18:25:36 node02 crmd[1898]: notice: te_rsc_command:
Initiating action 5: demote MyData_demote_0 on iclnode02 (local)
Mar 3 18:25:36 node02 kernel: block drbd0: role( Primary -> Secondary )
Mar 3 18:25:36 node02 kernel: block drbd0: bitmap WRITE of 0 pages
took 0 jiffies
Mar 3 18:25:36 node02 kernel: block drbd0: 0 KB (0 bits) marked
out-of-sync by on disk bit-map.
Mar 3 18:25:36 node02 crmd[1898]: notice: process_lrm_event: LRM
operation MyData_demote_0 (call=314, rc=0, cib-update=92,
confirmed=true) ok
Mar 3 18:25:36 node02 crmd[1898]: notice: te_rsc_command:
Initiating action 54: notify MyData_post_notify_demote_0 on iclnode02
(local)
Mar 3 18:25:36 node02 crmd[1898]: notice: process_lrm_event: LRM
operation MyData_notify_0 (call=317, rc=0, cib-update=0,
confirmed=true) ok
Suppose I know that iclnode01 has a permanent problem and I can't
recover it for some time, what is the correct manual action from
pacemaker point of view to force iclnode02 to carry on the service
(I have a group configured with the standard way colocation + order:
# pcs constraint colocation add Started my_group with Master ms_MyData INFINITY
# pcs constraint order promote ms_MyData then start my_group
)
and is there any automated action to manage this kind of problems in
2-nodes clusters?
Can I solve this problem if I configure a stonith agent?
Thanks in advance,
Gianluca
More information about the Pacemaker
mailing list