[Pacemaker] pacemaker with cman and dbrd when primary node panics or poweroff

Mon Mar 3 15:29:30 EST 2014

Two possible problems;

1. cman's cluster.conf needs the '<cman two_node="1" expected_votes="1" />'.

2. You don't have fencing setup. The 'fence_pcmk' script only works if 
pacemaker's stonith is enabled and configured properly. Likewise, you 
will need to configure DRBD to use the 'crm-fence-peer.sh' handler and 
have the 'fencing resource-and-stonith;' policy.

digimer

On 03/03/14 01:09 PM, Gianluca Cecchi wrote:
> Hello,
> I'm testing pacemaker with cman on CentOS 6.5 where I have drbd
> resource in classic primary/secondary setup with master/slave config
>
> Relevant packages:
> cman-3.0.12.1-59.el6_5.1.x86_64
> pacemaker-1.1.10-14.el6_5.2.x86_64
> kmod-drbd84-8.4.4-1.el6.elrepo.x86_64
> drbd84-utils-8.4.4-2.el6.elrepo.x86_64
> kernel 2.6.32-431.5.1.el6.x86_64
>
>  From cman point of view I delegated fencing to pacemaker with
> fence_pcmk fence agent in cluster.conf
>
>  From pacemaker point of view:
> # pcs cluster cib | grep cib-boot | egrep "quorum|stonith"
>          <nvpair id="cib-bootstrap-options-stonith-enabled"
> name="stonith-enabled" value="false"/>
>          <nvpair id="cib-bootstrap-options-no-quorum-policy"
> name="no-quorum-policy" value="ignore"/>
>
>  From drbd point ov view:
> resource res0 {
> disk {
> disk-flushes no;
> md-flushes no;
> fencing resource-only;
> }
>   device minor 0;
>   disk /dev/sdb;
>   syncer {
>   rate 30M;
>   verify-alg md5;
>   }
>   handlers {
>   fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
>   after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
>   }
>
> What is the expected behavior if I force a power off of the primary
> node where the resource is master?
>
> In my case where I test and power off iclnode01 the status remains:
> Last updated: Mon Mar  3 18:37:02 2014
> Last change: Mon Mar  3 18:37:02 2014 via crmd on iclnode02
> Stack: cman
> Current DC: iclnode02 - partition WITHOUT quorum
> Version: 1.1.10-14.el6_5.2-368c726
> 2 Nodes configured
> 12 Resources configured
>
>
> Online: [ iclnode02 ]
> OFFLINE: [ iclnode01 ]
>
>   Master/Slave Set: ms_MyData [MyData]
>       Slaves: [ iclnode02 ]
>       Stopped: [ iclnode01 ]
>
>
> and
> # cat /proc/drbd
> version: 8.4.4 (api:1/proto:86-101)
> GIT-hash: 599f286440bd633d15d5ff985204aff4bccffadd build by
> phil at Build64R6, 2013-10-14 15:33:06
>   0: cs:WFConnection ro:Secondary/Unknown ds:UpToDate/Outdated C r-----
>      ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0
>
> In messages I see the crm-fence-peer.sh did its job with putting constraint
> Mar  3 18:25:35 node02 kernel: drbd res0: helper command:
> /sbin/drbdadm fence-peer res0
> Mar  3 18:25:35 node02 crm-fence-peer.sh[7633]: invoked for res0
> Mar  3 18:25:35 node02 cibadmin[7664]:   notice: crm_log_args:
> Invoked: cibadmin -C -o constraints -X <rsc_location rsc="ms_MyData"
> id="drbd-fence-by-handler-res0-ms_MyData">#012  <rule role="Master"
> score="-INFINITY" id="drbd-fence-by-handler-res0-rule-ms_MyData">#012
>    <expression attribute="#uname" operation="ne"
> value="node02.localdomain.local"
> id="drbd-fence-by-handler-res0-expr-ms_MyData"/>#012
> </rule>#012</rsc_location>
> Mar  3 18:25:35 node02 stonith-ng[1894]:   notice: unpack_config: On
> loss of CCM Quorum: Ignore
> Mar  3 18:25:35 node02 cib[1893]:   notice: cib:diff: Diff: --- 0.127.36
> Mar  3 18:25:35 node02 cib[1893]:   notice: cib:diff: Diff: +++
> 0.128.1 6e071e71b96b076e87b27c299ba3057d
> Mar  3 18:25:35 node02 cib[1893]:   notice: cib:diff: -- <cib
> admin_epoch="0" epoch="127" num_updates="36"/>
> Mar  3 18:25:35 node02 cib[1893]:   notice: cib:diff: ++
> <rsc_location rsc="ms_MyData"
> id="drbd-fence-by-handler-res0-ms_MyData">
> Mar  3 18:25:35 node02 cib[1893]:   notice: cib:diff: ++         <rule
> role="Master" score="-INFINITY"
> id="drbd-fence-by-handler-res0-rule-ms_MyData">
> Mar  3 18:25:35 node02 cib[1893]:   notice: cib:diff: ++
> <expression attribute="#uname" operation="ne"
> value="node02.localdomain.local"
> id="drbd-fence-by-handler-res0-expr-ms_MyData"/>
> Mar  3 18:25:35 node02 cib[1893]:   notice: cib:diff: ++         </rule>
> Mar  3 18:25:35 node02 cib[1893]:   notice: cib:diff: ++       </rsc_location>
> Mar  3 18:25:35 node02 crm-fence-peer.sh[7633]: INFO peer is not
> reachable, my disk is UpToDate: placed constraint
> 'drbd-fence-by-handler-res0-ms_MyData'
> Mar  3 18:25:35 node02 kernel: drbd res0: helper command:
> /sbin/drbdadm fence-peer res0 exit code 5 (0x500)
> Mar  3 18:25:35 node02 kernel: drbd res0: fence-peer helper returned 5
> (peer is unreachable, assumed to be dead)
> Mar  3 18:25:35 node02 kernel: drbd res0: pdsk( DUnknown -> Outdated )
> Mar  3 18:25:35 node02 kernel: block drbd0: role( Secondary -> Primary )
> Mar  3 18:25:35 node02 kernel: block drbd0: new current UUID
> 03E9D09641694365:B5B5224185905A78:83887B50434B5AB6:83877B50434B5AB6
>
> but soon after it is demoted since "monitor" found it in master...
> Mar  3 18:25:35 node02 crmd[1898]:   notice: process_lrm_event: LRM
> operation MyData_promote_0 (call=305, rc=0, cib-update=90,
> confirmed=true) ok
> Mar  3 18:25:35 node02 crmd[1898]:   notice: te_rsc_command:
> Initiating action 54: notify MyData_post_notify_promote_0 on iclnode02
> (local)
> Mar  3 18:25:36 node02 crmd[1898]:   notice: process_lrm_event: LRM
> operation MyData_notify_0 (call=308, rc=0, cib-update=0,
> confirmed=true) ok
> Mar  3 18:25:36 node02 crmd[1898]:   notice: run_graph: Transition 1
> (Complete=9, Pending=0, Fired=0, Skipped=2, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-995.bz2): Stopped
> Mar  3 18:25:36 node02 pengine[1897]:   notice: unpack_config: On loss
> of CCM Quorum: Ignore
> Mar  3 18:25:36 node02 pengine[1897]:   notice: unpack_rsc_op:
> Operation monitor found resource MyData:0 active in master mode on
> iclnode02
> Mar  3 18:25:36 node02 pengine[1897]:   notice: LogActions: Demote
> MyData:0#011(Master -> Slave iclnode02)
> Mar  3 18:25:36 node02 pengine[1897]:   notice: process_pe_message:
> Calculated Transition 2: /var/lib/pacemaker/pengine/pe-input-996.bz2
> Mar  3 18:25:36 node02 crmd[1898]:   notice: te_rsc_command:
> Initiating action 53: notify MyData_pre_notify_demote_0 on iclnode02
> (local)
> Mar  3 18:25:36 node02 crmd[1898]:   notice: process_lrm_event: LRM
> operation MyData_notify_0 (call=311, rc=0, cib-update=0,
> confirmed=true) ok
> Mar  3 18:25:36 node02 crmd[1898]:   notice: te_rsc_command:
> Initiating action 5: demote MyData_demote_0 on iclnode02 (local)
> Mar  3 18:25:36 node02 kernel: block drbd0: role( Primary -> Secondary )
> Mar  3 18:25:36 node02 kernel: block drbd0: bitmap WRITE of 0 pages
> took 0 jiffies
> Mar  3 18:25:36 node02 kernel: block drbd0: 0 KB (0 bits) marked
> out-of-sync by on disk bit-map.
> Mar  3 18:25:36 node02 crmd[1898]:   notice: process_lrm_event: LRM
> operation MyData_demote_0 (call=314, rc=0, cib-update=92,
> confirmed=true) ok
> Mar  3 18:25:36 node02 crmd[1898]:   notice: te_rsc_command:
> Initiating action 54: notify MyData_post_notify_demote_0 on iclnode02
> (local)
> Mar  3 18:25:36 node02 crmd[1898]:   notice: process_lrm_event: LRM
> operation MyData_notify_0 (call=317, rc=0, cib-update=0,
> confirmed=true) ok
>
>
> Suppose I know that iclnode01 has a permanent problem and I can't
> recover it for some time, what is the correct manual action from
> pacemaker point of view to force iclnode02 to carry on the service
> (I have a group configured with the standard way colocation + order:
> #  pcs constraint colocation add Started my_group with Master ms_MyData INFINITY
> # pcs constraint order promote ms_MyData then start my_group
> )
>
> and is there any automated action to manage this kind of problems in
> 2-nodes clusters?
> Can I solve this problem if I configure a stonith agent?
>
> Thanks in advance,
> Gianluca
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?