[Pacemaker] Trying to figure out a constraint
Digimer
lists at alteeve.ca
Thu Jun 19 06:06:44 CEST 2014
On 18/06/14 11:42 PM, Digimer wrote:
> On 18/06/14 12:47 AM, Andrew Beekhof wrote:
>>
>> On 18 Jun 2014, at 2:03 pm, Digimer <lists at alteeve.ca> wrote:
>>
>>> Hi all,
>>>
>>> I am trying to setup a basic pacemaker 1.1.10 on RHEL 6.5 with DRBD
>>> 8.3.16.
>>>
>>> I've setup DRBD and configured one clustered LVM volume group using
>>> that drbd resource as the PV. With DRBD configured alone, I can
>>> stop/start pacemaker repeatedly without issue. However, when I add
>>> the LVM VG using ocf:heartbeat:LVM and setup a constraint, subsequent
>>> restarts of pacemaker almost always end up with a fence. I have to
>>> think then that I am messing up my constraints...
>>
>> find out who is calling stonith_admin:
>>
>> Jun 17 23:56:06 an-a04n01 kernel: block drbd0: helper command:
>> /sbin/drbdadm fence-peer minor-0
>> Jun 17 23:56:07 an-a04n01 kernel: block drbd0: Handshake successful:
>> Agreed network protocol version 97
>> Jun 17 23:56:07 an-a04n01 stonith_admin[28637]: notice:
>> crm_log_args: Invoked: stonith_admin --fence an-a04n02.alteeve.ca
>> Jun 17 23:56:07 an-a04n01 stonith-ng[28356]: notice: handle_request:
>> Client stonith_admin.28637.6ed13ba6 wants to fence (off)
>> 'an-a04n02.alteeve.ca' with device '(any)'
>>
>> Double check fence_pcmk includes "--tag cman" as an argument to
>> stonith_admin (since that will rule it out as a source).
>> Could drbd be initiating it?
>
> Following up on #linux-ha discussion...
>
> DRBD was triggering this. When I used
> /usr/lib/drbd/stonith_admin-fence-peer.sh, I saw both nodes sit at
> 'WFConnection' before it fenced, which normally tells me there is a
> network issue. However manually start DRBD never had a problem, and
> after the fenced node comes back up, starting pacemaker causes it to
> start fine.
>
> So I decided to try '/usr/lib/drbd/crm-fence-peer.sh' instead. Now,
> instead of fencing, an-a04n02 (node 2) fails to promote. However, if I
> try running:
>
> pcs resource debug-start drbd_r0
>
> I get:
>
> Operation start for drbd_r0:0 (ocf:linbit:drbd) returned 0
> > stdout: allow-two-primaries;
> > stdout:
> > stdout:
> > stderr: WARNING: You may be disappointed: This RA is intended for
> pacemaker 1.0 or better!
> > stderr: DEBUG: r0: Calling drbdadm -c /etc/drbd.conf adjust r0
> > stderr: DEBUG: r0: Exit code 0
> > stderr: DEBUG: r0: Command output:
> > stderr: DEBUG: r0: Calling /usr/sbin/crm_master -Q -l reboot -v 10000
> > stderr: DEBUG: r0: Exit code 0
> > stderr: DEBUG: r0: Command output:
>
> This seems to be that it thinks it will work. However, the cluster is
> left at:
>
> Cluster name: an-anvil-04
> Last updated: Wed Jun 18 23:37:43 2014
> Last change: Wed Jun 18 23:19:24 2014 via cibadmin on an-a04n02.alteeve.ca
> Stack: cman
> Current DC: an-a04n01.alteeve.ca - partition with quorum
> Version: 1.1.10-14.el6_5.3-368c726
> 2 Nodes configured
> 4 Resources configured
>
>
> Online: [ an-a04n01.alteeve.ca an-a04n02.alteeve.ca ]
>
> Full list of resources:
>
> fence_n01_ipmi (stonith:fence_ipmilan): Started
> an-a04n01.alteeve.ca
> fence_n02_ipmi (stonith:fence_ipmilan): Started
> an-a04n02.alteeve.ca
> Master/Slave Set: drbd_r0_Clone [drbd_r0]
> Masters: [ an-a04n01.alteeve.ca ]
> Slaves: [ an-a04n02.alteeve.ca ]
>
> Then I check that constraints, I see that this has been created:
>
> Location Constraints:
> Resource: drbd_r0_Clone
> Constraint: drbd-fence-by-handler-r0-drbd_r0_Clone
> Rule: score=-INFINITY role=Master
> (id:drbd-fence-by-handler-r0-rule-drbd_r0_Clone)
> Expression: #uname ne an-a04n01.alteeve.ca
> (id:drbd-fence-by-handler-r0-expr-drbd_r0_Clone)
> Ordering Constraints:
> Colocation Constraints:
>
> If I delete the constraint, node 2 suddenly promotes properly.
>
> So I have to conclude that, for some reason, the way I am using
> ocf:linbit:drbd is wrong, or I've not configured
> '/etc/drbd.d/global_common.conf' properly.
>
> Speaking of which, that is:
>
> # /etc/drbd.conf
> common {
> protocol C;
> net {
> allow-two-primaries;
> after-sb-0pri discard-zero-changes;
> after-sb-1pri discard-secondary;
> after-sb-2pri disconnect;
> }
> disk {
> fencing resource-and-stonith;
> }
> syncer {
> rate 40M;
> }
> handlers {
> fence-peer /usr/lib/drbd/crm-fence-peer.sh;
> }
> }
>
> # resource r0 on an-a04n01.alteeve.ca: not ignored, not stacked
> resource r0 {
> on an-a04n01.alteeve.ca {
> device /dev/drbd0 minor 0;
> disk /dev/sda5;
> address ipv4 10.10.40.1:7788;
> meta-disk internal;
> }
> on an-a04n02.alteeve.ca {
> device /dev/drbd0 minor 0;
> disk /dev/sda5;
> address ipv4 10.10.40.2:7788;
> meta-disk internal;
> }
> }
>
> # resource r1 on an-a04n01.alteeve.ca: not ignored, not stacked
> resource r1 {
> on an-a04n01.alteeve.ca {
> device /dev/drbd1 minor 1;
> disk /dev/sda6;
> address ipv4 10.10.40.1:7789;
> meta-disk internal;
> }
> on an-a04n02.alteeve.ca {
> device /dev/drbd1 minor 1;
> disk /dev/sda6;
> address ipv4 10.10.40.2:7789;
> meta-disk internal;
> }
> }
>
> Note that, for the time being, I've not configured r1 in pacemaker to
> simplify the config while debugging.
>
> Attached is the crm_report, hopefully it might shed some light on what I
> am doing wrong.
>
> Thanks!
>
> digimer
After sending this, I found that adding:
handlers {
fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
}
Allowed the constraint to be removed, so eventually node 2 (an-a04n02)
eventually promoted, but not before going into the failed state shown above.
Subsequent stop -> start of pacemaker on both nodes started cleanly, not
fence action reported in /var/log/messages. I notices this time that the
drbd module was loaded, not sure if that made a difference.
Will keep testing... Any insight is much appreciated.
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
More information about the Pacemaker
mailing list