[Pacemaker] Trying to figure out a constraint

Thu Jun 19 05:42:39 CEST 2014

On 18/06/14 12:47 AM, Andrew Beekhof wrote:
>
> On 18 Jun 2014, at 2:03 pm, Digimer <lists at alteeve.ca> wrote:
>
>> Hi all,
>>
>>   I am trying to setup a basic pacemaker 1.1.10 on RHEL 6.5 with DRBD 8.3.16.
>>
>>   I've setup DRBD and configured one clustered LVM volume group using that drbd resource as the PV. With DRBD configured alone, I can stop/start pacemaker repeatedly without issue. However, when I add the LVM VG using ocf:heartbeat:LVM and setup a constraint, subsequent restarts of pacemaker almost always end up with a fence. I have to think then that I am messing up my constraints...
>
> find out who is calling stonith_admin:
>
> Jun 17 23:56:06 an-a04n01 kernel: block drbd0: helper command: /sbin/drbdadm fence-peer minor-0
> Jun 17 23:56:07 an-a04n01 kernel: block drbd0: Handshake successful: Agreed network protocol version 97
> Jun 17 23:56:07 an-a04n01 stonith_admin[28637]:   notice: crm_log_args: Invoked: stonith_admin --fence an-a04n02.alteeve.ca
> Jun 17 23:56:07 an-a04n01 stonith-ng[28356]:   notice: handle_request: Client stonith_admin.28637.6ed13ba6 wants to fence (off) 'an-a04n02.alteeve.ca' with device '(any)'
>
> Double check fence_pcmk includes "--tag cman" as an argument to stonith_admin (since that will rule it out as a source).
> Could drbd be initiating it?

Following up on #linux-ha discussion...

DRBD was triggering this. When I used 
/usr/lib/drbd/stonith_admin-fence-peer.sh, I saw both nodes sit at 
'WFConnection' before it fenced, which normally tells me there is a 
network issue. However manually start DRBD never had a problem, and 
after the fenced node comes back up, starting pacemaker causes it to 
start fine.

So I decided to try '/usr/lib/drbd/crm-fence-peer.sh' instead. Now, 
instead of fencing, an-a04n02 (node 2) fails to promote. However, if I 
try running:

pcs resource debug-start drbd_r0

I get:

Operation start for drbd_r0:0 (ocf:linbit:drbd) returned 0
  >  stdout:         allow-two-primaries;
  >  stdout:
  >  stdout:
  >  stderr: WARNING: You may be disappointed: This RA is intended for 
pacemaker 1.0 or better!
  >  stderr: DEBUG: r0: Calling drbdadm -c /etc/drbd.conf adjust r0
  >  stderr: DEBUG: r0: Exit code 0
  >  stderr: DEBUG: r0: Command output:
  >  stderr: DEBUG: r0: Calling /usr/sbin/crm_master -Q -l reboot -v 10000
  >  stderr: DEBUG: r0: Exit code 0
  >  stderr: DEBUG: r0: Command output:

This seems to be that it thinks it will work. However, the cluster is 
left at:

Cluster name: an-anvil-04
Last updated: Wed Jun 18 23:37:43 2014
Last change: Wed Jun 18 23:19:24 2014 via cibadmin on an-a04n02.alteeve.ca
Stack: cman
Current DC: an-a04n01.alteeve.ca - partition with quorum
Version: 1.1.10-14.el6_5.3-368c726
2 Nodes configured
4 Resources configured

Online: [ an-a04n01.alteeve.ca an-a04n02.alteeve.ca ]

Full list of resources:

  fence_n01_ipmi	(stonith:fence_ipmilan):	Started an-a04n01.alteeve.ca
  fence_n02_ipmi	(stonith:fence_ipmilan):	Started an-a04n02.alteeve.ca
  Master/Slave Set: drbd_r0_Clone [drbd_r0]
      Masters: [ an-a04n01.alteeve.ca ]
      Slaves: [ an-a04n02.alteeve.ca ]

Then I check that constraints, I see that this has been created:

Location Constraints:
   Resource: drbd_r0_Clone
     Constraint: drbd-fence-by-handler-r0-drbd_r0_Clone
       Rule: score=-INFINITY role=Master 
(id:drbd-fence-by-handler-r0-rule-drbd_r0_Clone)
         Expression: #uname ne an-a04n01.alteeve.ca 
(id:drbd-fence-by-handler-r0-expr-drbd_r0_Clone)
Ordering Constraints:
Colocation Constraints:

If I delete the constraint, node 2 suddenly promotes properly.

So I have to conclude that, for some reason, the way I am using 
ocf:linbit:drbd is wrong, or I've not configured 
'/etc/drbd.d/global_common.conf' properly.

Speaking of which, that is:

# /etc/drbd.conf
common {
     protocol               C;
     net {
         allow-two-primaries;
         after-sb-0pri    discard-zero-changes;
         after-sb-1pri    discard-secondary;
         after-sb-2pri    disconnect;
     }
     disk {
         fencing          resource-and-stonith;
     }
     syncer {
         rate             40M;
     }
     handlers {
         fence-peer       /usr/lib/drbd/crm-fence-peer.sh;
     }
}

# resource r0 on an-a04n01.alteeve.ca: not ignored, not stacked
resource r0 {
     on an-a04n01.alteeve.ca {
         device           /dev/drbd0 minor 0;
         disk             /dev/sda5;
         address          ipv4 10.10.40.1:7788;
         meta-disk        internal;
     }
     on an-a04n02.alteeve.ca {
         device           /dev/drbd0 minor 0;
         disk             /dev/sda5;
         address          ipv4 10.10.40.2:7788;
         meta-disk        internal;
     }
}

# resource r1 on an-a04n01.alteeve.ca: not ignored, not stacked
resource r1 {
     on an-a04n01.alteeve.ca {
         device           /dev/drbd1 minor 1;
         disk             /dev/sda6;
         address          ipv4 10.10.40.1:7789;
         meta-disk        internal;
     }
     on an-a04n02.alteeve.ca {
         device           /dev/drbd1 minor 1;
         disk             /dev/sda6;
         address          ipv4 10.10.40.2:7789;
         meta-disk        internal;
     }
}

Note that, for the time being, I've not configured r1 in pacemaker to 
simplify the config while debugging.

Attached is the crm_report, hopefully it might shed some light on what I 
am doing wrong.

Thanks!

digimer

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: an-anvil-04_2014-06-18_01.tar.bz2
Type: application/x-bzip
Size: 180922 bytes
Desc: not available
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20140618/70265908/attachment-0001.bin>