[Pacemaker] Trying to figure out a constraint
Digimer
lists at alteeve.ca
Thu Jun 19 03:42:39 UTC 2014
On 18/06/14 12:47 AM, Andrew Beekhof wrote:
>
> On 18 Jun 2014, at 2:03 pm, Digimer <lists at alteeve.ca> wrote:
>
>> Hi all,
>>
>> I am trying to setup a basic pacemaker 1.1.10 on RHEL 6.5 with DRBD 8.3.16.
>>
>> I've setup DRBD and configured one clustered LVM volume group using that drbd resource as the PV. With DRBD configured alone, I can stop/start pacemaker repeatedly without issue. However, when I add the LVM VG using ocf:heartbeat:LVM and setup a constraint, subsequent restarts of pacemaker almost always end up with a fence. I have to think then that I am messing up my constraints...
>
> find out who is calling stonith_admin:
>
> Jun 17 23:56:06 an-a04n01 kernel: block drbd0: helper command: /sbin/drbdadm fence-peer minor-0
> Jun 17 23:56:07 an-a04n01 kernel: block drbd0: Handshake successful: Agreed network protocol version 97
> Jun 17 23:56:07 an-a04n01 stonith_admin[28637]: notice: crm_log_args: Invoked: stonith_admin --fence an-a04n02.alteeve.ca
> Jun 17 23:56:07 an-a04n01 stonith-ng[28356]: notice: handle_request: Client stonith_admin.28637.6ed13ba6 wants to fence (off) 'an-a04n02.alteeve.ca' with device '(any)'
>
> Double check fence_pcmk includes "--tag cman" as an argument to stonith_admin (since that will rule it out as a source).
> Could drbd be initiating it?
Following up on #linux-ha discussion...
DRBD was triggering this. When I used
/usr/lib/drbd/stonith_admin-fence-peer.sh, I saw both nodes sit at
'WFConnection' before it fenced, which normally tells me there is a
network issue. However manually start DRBD never had a problem, and
after the fenced node comes back up, starting pacemaker causes it to
start fine.
So I decided to try '/usr/lib/drbd/crm-fence-peer.sh' instead. Now,
instead of fencing, an-a04n02 (node 2) fails to promote. However, if I
try running:
pcs resource debug-start drbd_r0
I get:
Operation start for drbd_r0:0 (ocf:linbit:drbd) returned 0
> stdout: allow-two-primaries;
> stdout:
> stdout:
> stderr: WARNING: You may be disappointed: This RA is intended for
pacemaker 1.0 or better!
> stderr: DEBUG: r0: Calling drbdadm -c /etc/drbd.conf adjust r0
> stderr: DEBUG: r0: Exit code 0
> stderr: DEBUG: r0: Command output:
> stderr: DEBUG: r0: Calling /usr/sbin/crm_master -Q -l reboot -v 10000
> stderr: DEBUG: r0: Exit code 0
> stderr: DEBUG: r0: Command output:
This seems to be that it thinks it will work. However, the cluster is
left at:
Cluster name: an-anvil-04
Last updated: Wed Jun 18 23:37:43 2014
Last change: Wed Jun 18 23:19:24 2014 via cibadmin on an-a04n02.alteeve.ca
Stack: cman
Current DC: an-a04n01.alteeve.ca - partition with quorum
Version: 1.1.10-14.el6_5.3-368c726
2 Nodes configured
4 Resources configured
Online: [ an-a04n01.alteeve.ca an-a04n02.alteeve.ca ]
Full list of resources:
fence_n01_ipmi (stonith:fence_ipmilan): Started an-a04n01.alteeve.ca
fence_n02_ipmi (stonith:fence_ipmilan): Started an-a04n02.alteeve.ca
Master/Slave Set: drbd_r0_Clone [drbd_r0]
Masters: [ an-a04n01.alteeve.ca ]
Slaves: [ an-a04n02.alteeve.ca ]
Then I check that constraints, I see that this has been created:
Location Constraints:
Resource: drbd_r0_Clone
Constraint: drbd-fence-by-handler-r0-drbd_r0_Clone
Rule: score=-INFINITY role=Master
(id:drbd-fence-by-handler-r0-rule-drbd_r0_Clone)
Expression: #uname ne an-a04n01.alteeve.ca
(id:drbd-fence-by-handler-r0-expr-drbd_r0_Clone)
Ordering Constraints:
Colocation Constraints:
If I delete the constraint, node 2 suddenly promotes properly.
So I have to conclude that, for some reason, the way I am using
ocf:linbit:drbd is wrong, or I've not configured
'/etc/drbd.d/global_common.conf' properly.
Speaking of which, that is:
# /etc/drbd.conf
common {
protocol C;
net {
allow-two-primaries;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
disk {
fencing resource-and-stonith;
}
syncer {
rate 40M;
}
handlers {
fence-peer /usr/lib/drbd/crm-fence-peer.sh;
}
}
# resource r0 on an-a04n01.alteeve.ca: not ignored, not stacked
resource r0 {
on an-a04n01.alteeve.ca {
device /dev/drbd0 minor 0;
disk /dev/sda5;
address ipv4 10.10.40.1:7788;
meta-disk internal;
}
on an-a04n02.alteeve.ca {
device /dev/drbd0 minor 0;
disk /dev/sda5;
address ipv4 10.10.40.2:7788;
meta-disk internal;
}
}
# resource r1 on an-a04n01.alteeve.ca: not ignored, not stacked
resource r1 {
on an-a04n01.alteeve.ca {
device /dev/drbd1 minor 1;
disk /dev/sda6;
address ipv4 10.10.40.1:7789;
meta-disk internal;
}
on an-a04n02.alteeve.ca {
device /dev/drbd1 minor 1;
disk /dev/sda6;
address ipv4 10.10.40.2:7789;
meta-disk internal;
}
}
Note that, for the time being, I've not configured r1 in pacemaker to
simplify the config while debugging.
Attached is the crm_report, hopefully it might shed some light on what I
am doing wrong.
Thanks!
digimer
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: an-anvil-04_2014-06-18_01.tar.bz2
Type: application/x-bzip
Size: 180922 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140618/70265908/attachment-0004.bin>
More information about the Pacemaker
mailing list