[Pacemaker] Trying to figure out a constraint

Thu Jun 19 00:06:44 EDT 2014

On 18/06/14 11:42 PM, Digimer wrote:
> On 18/06/14 12:47 AM, Andrew Beekhof wrote:
>>
>> On 18 Jun 2014, at 2:03 pm, Digimer <lists at alteeve.ca> wrote:
>>
>>> Hi all,
>>>
>>>   I am trying to setup a basic pacemaker 1.1.10 on RHEL 6.5 with DRBD
>>> 8.3.16.
>>>
>>>   I've setup DRBD and configured one clustered LVM volume group using
>>> that drbd resource as the PV. With DRBD configured alone, I can
>>> stop/start pacemaker repeatedly without issue. However, when I add
>>> the LVM VG using ocf:heartbeat:LVM and setup a constraint, subsequent
>>> restarts of pacemaker almost always end up with a fence. I have to
>>> think then that I am messing up my constraints...
>>
>> find out who is calling stonith_admin:
>>
>> Jun 17 23:56:06 an-a04n01 kernel: block drbd0: helper command:
>> /sbin/drbdadm fence-peer minor-0
>> Jun 17 23:56:07 an-a04n01 kernel: block drbd0: Handshake successful:
>> Agreed network protocol version 97
>> Jun 17 23:56:07 an-a04n01 stonith_admin[28637]:   notice:
>> crm_log_args: Invoked: stonith_admin --fence an-a04n02.alteeve.ca
>> Jun 17 23:56:07 an-a04n01 stonith-ng[28356]:   notice: handle_request:
>> Client stonith_admin.28637.6ed13ba6 wants to fence (off)
>> 'an-a04n02.alteeve.ca' with device '(any)'
>>
>> Double check fence_pcmk includes "--tag cman" as an argument to
>> stonith_admin (since that will rule it out as a source).
>> Could drbd be initiating it?
>
> Following up on #linux-ha discussion...
>
> DRBD was triggering this. When I used
> /usr/lib/drbd/stonith_admin-fence-peer.sh, I saw both nodes sit at
> 'WFConnection' before it fenced, which normally tells me there is a
> network issue. However manually start DRBD never had a problem, and
> after the fenced node comes back up, starting pacemaker causes it to
> start fine.
>
> So I decided to try '/usr/lib/drbd/crm-fence-peer.sh' instead. Now,
> instead of fencing, an-a04n02 (node 2) fails to promote. However, if I
> try running:
>
> pcs resource debug-start drbd_r0
>
> I get:
>
> Operation start for drbd_r0:0 (ocf:linbit:drbd) returned 0
>   >  stdout:         allow-two-primaries;
>   >  stdout:
>   >  stdout:
>   >  stderr: WARNING: You may be disappointed: This RA is intended for
> pacemaker 1.0 or better!
>   >  stderr: DEBUG: r0: Calling drbdadm -c /etc/drbd.conf adjust r0
>   >  stderr: DEBUG: r0: Exit code 0
>   >  stderr: DEBUG: r0: Command output:
>   >  stderr: DEBUG: r0: Calling /usr/sbin/crm_master -Q -l reboot -v 10000
>   >  stderr: DEBUG: r0: Exit code 0
>   >  stderr: DEBUG: r0: Command output:
>
> This seems to be that it thinks it will work. However, the cluster is
> left at:
>
> Cluster name: an-anvil-04
> Last updated: Wed Jun 18 23:37:43 2014
> Last change: Wed Jun 18 23:19:24 2014 via cibadmin on an-a04n02.alteeve.ca
> Stack: cman
> Current DC: an-a04n01.alteeve.ca - partition with quorum
> Version: 1.1.10-14.el6_5.3-368c726
> 2 Nodes configured
> 4 Resources configured
>
>
> Online: [ an-a04n01.alteeve.ca an-a04n02.alteeve.ca ]
>
> Full list of resources:
>
>   fence_n01_ipmi    (stonith:fence_ipmilan):    Started
> an-a04n01.alteeve.ca
>   fence_n02_ipmi    (stonith:fence_ipmilan):    Started
> an-a04n02.alteeve.ca
>   Master/Slave Set: drbd_r0_Clone [drbd_r0]
>       Masters: [ an-a04n01.alteeve.ca ]
>       Slaves: [ an-a04n02.alteeve.ca ]
>
> Then I check that constraints, I see that this has been created:
>
> Location Constraints:
>    Resource: drbd_r0_Clone
>      Constraint: drbd-fence-by-handler-r0-drbd_r0_Clone
>        Rule: score=-INFINITY role=Master
> (id:drbd-fence-by-handler-r0-rule-drbd_r0_Clone)
>          Expression: #uname ne an-a04n01.alteeve.ca
> (id:drbd-fence-by-handler-r0-expr-drbd_r0_Clone)
> Ordering Constraints:
> Colocation Constraints:
>
> If I delete the constraint, node 2 suddenly promotes properly.
>
> So I have to conclude that, for some reason, the way I am using
> ocf:linbit:drbd is wrong, or I've not configured
> '/etc/drbd.d/global_common.conf' properly.
>
> Speaking of which, that is:
>
> # /etc/drbd.conf
> common {
>      protocol               C;
>      net {
>          allow-two-primaries;
>          after-sb-0pri    discard-zero-changes;
>          after-sb-1pri    discard-secondary;
>          after-sb-2pri    disconnect;
>      }
>      disk {
>          fencing          resource-and-stonith;
>      }
>      syncer {
>          rate             40M;
>      }
>      handlers {
>          fence-peer       /usr/lib/drbd/crm-fence-peer.sh;
>      }
> }
>
> # resource r0 on an-a04n01.alteeve.ca: not ignored, not stacked
> resource r0 {
>      on an-a04n01.alteeve.ca {
>          device           /dev/drbd0 minor 0;
>          disk             /dev/sda5;
>          address          ipv4 10.10.40.1:7788;
>          meta-disk        internal;
>      }
>      on an-a04n02.alteeve.ca {
>          device           /dev/drbd0 minor 0;
>          disk             /dev/sda5;
>          address          ipv4 10.10.40.2:7788;
>          meta-disk        internal;
>      }
> }
>
> # resource r1 on an-a04n01.alteeve.ca: not ignored, not stacked
> resource r1 {
>      on an-a04n01.alteeve.ca {
>          device           /dev/drbd1 minor 1;
>          disk             /dev/sda6;
>          address          ipv4 10.10.40.1:7789;
>          meta-disk        internal;
>      }
>      on an-a04n02.alteeve.ca {
>          device           /dev/drbd1 minor 1;
>          disk             /dev/sda6;
>          address          ipv4 10.10.40.2:7789;
>          meta-disk        internal;
>      }
> }
>
> Note that, for the time being, I've not configured r1 in pacemaker to
> simplify the config while debugging.
>
> Attached is the crm_report, hopefully it might shed some light on what I
> am doing wrong.
>
> Thanks!
>
> digimer

After sending this, I found that adding:

handlers {
	fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
	after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
}

Allowed the constraint to be removed, so eventually node 2 (an-a04n02) 
eventually promoted, but not before going into the failed state shown above.

Subsequent stop -> start of pacemaker on both nodes started cleanly, not 
fence action reported in /var/log/messages. I notices this time that the 
drbd module was loaded, not sure if that made a difference.

Will keep testing... Any insight is much appreciated.
-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?