[Pacemaker] Collocating resource with a started clone instance

Fri Jun 22 05:58:55 EDT 2012

Hi!

I'm trying to set up a 2-node cluster. I'm new to pacemaker, but
things are getting better and better. However, I am completely at a
loss here.

I have a cloned tomcat resource, which runs on both nodes and doesn't
really depend on anything (it doesn't use DRBD or anything else of
that sort). But I'm trying to get pacemaker move the cluster IP to
another node in case tomcat fails. Here's the relevant parts of my
config:

node srvplan1
node srvplan2
primitive DBIP ocf:heartbeat:IPaddr2 \
       params ip="1.2.3.4" cidr_netmask="24" \
       op monitor interval="10s"
primitive drbd_pgdrive ocf:linbit:drbd \
       params drbd_resource="pgdrive" \
       op start interval="0" timeout="240" \
       op stop interval="0" timeout="100"
primitive pgdrive_fs ocf:heartbeat:Filesystem \
       params device="/dev/drbd0" directory="/hd2" fstype="ext4"
primitive ping ocf:pacemaker:ping \
       params host_list="193.233.59.2" multiplier="1000" \
       op monitor interval="10"
primitive postgresql ocf:heartbeat:pgsql \
       params pgdata="/hd2/pgsql" \
       op monitor interval="30" timeout="30" depth="0" \
       op start interval="0" timeout="60" \
       op stop interval="0" timeout="60" \
       meta target-role="Started"
primitive tomcat ocf:heartbeat:tomcat \
       params java_home="/usr/lib/jvm/jre"
catalina_home="/usr/share/tomcat" tomcat_user="tomcat"
script_log="/home/tmo/log/tomcat.log"
statusurl="http://127.0.0.1:8080/status/" \
       op start interval="0" timeout="60" \
       op stop interval="0" timeout="120" \
       op monitor interval="30" timeout="30"
group postgres pgdrive_fs DBIP postgresql
ms ms_drbd_pgdrive drbd_pgdrive \
       meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
clone pings ping \
       meta interleave="true"
clone tomcats tomcat \
       meta interleave="true" target-role="Started"
location DBIPcheck DBIP \
       rule $id="DBIPcheck-rule" 10000: defined pingd and pingd gt 0
location master-prefer-node1 DBIP 50: srvplan1
colocation DBIP-on-web 1000: DBIP tomcats
colocation postgres_on_drbd inf: postgres ms_drbd_pgdrive:Master
order postgres_after_drbd inf: ms_drbd_pgdrive:promote postgres:start

As you can see, there are three explicit constraints for the DBIP
resource: preferred node (srvplan1, score 50), successful ping (score
10000) and running tomcat (score 1000). There's also the resource
stickiness set to 100. Implicit constraints include collocation of the
postgres group with the DRBD master instance.

The ping check works fine: if I unplug the external LAN cable or use
iptables to block pings, everything gets moved to another node.

Check for tomcat isn't working for some reason, though:

[root at srvplan1 bin]# crm_mon -1
============
Last updated: Fri Jun 22 10:06:59 2012
Last change: Fri Jun 22 09:43:16 2012 via cibadmin on srvplan1
Stack: openais
Current DC: srvplan1 - partition with quorum
Version: 1.1.7-2.fc16-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, 2 expected votes
17 Resources configured.
============

Online: [ srvplan1 srvplan2 ]

 Master/Slave Set: ms_drbd_pgdrive [drbd_pgdrive]
    Masters: [ srvplan1 ]
    Slaves: [ srvplan2 ]
 Resource Group: postgres
    pgdrive_fs (ocf::heartbeat:Filesystem):    Started srvplan1
    DBIP       (ocf::heartbeat:IPaddr2):       Started srvplan1
    postgresql (ocf::heartbeat:pgsql): Started srvplan1
 Clone Set: pings [ping]
    Started: [ srvplan1 srvplan2 ]
 Clone Set: tomcats [tomcat]
    Started: [ srvplan2 ]
    Stopped: [ tomcat:0 ]

Failed actions:
   tomcat:0_start_0 (node=srvplan1, call=37, rc=-2, status=Timed
Out): unknown exec error

As you can see, tomcat is stopped on srvplan1 (I have deliberately
messed up the startup scripts), but everything else still runs there.
ptest -L -s shows:

clone_color: ms_drbd_pgdrive allocation score on srvplan1: 10350
clone_color: ms_drbd_pgdrive allocation score on srvplan2: 10000
clone_color: drbd_pgdrive:0 allocation score on srvplan1: 10100
clone_color: drbd_pgdrive:0 allocation score on srvplan2: 0
clone_color: drbd_pgdrive:1 allocation score on srvplan1: 0
clone_color: drbd_pgdrive:1 allocation score on srvplan2: 10100
native_color: drbd_pgdrive:0 allocation score on srvplan1: 10100
native_color: drbd_pgdrive:0 allocation score on srvplan2: 0
native_color: drbd_pgdrive:1 allocation score on srvplan1: -INFINITY
native_color: drbd_pgdrive:1 allocation score on srvplan2: 10100
drbd_pgdrive:0 promotion score on srvplan1: 30700
drbd_pgdrive:1 promotion score on srvplan2: 30000
group_color: postgres allocation score on srvplan1: 0
group_color: postgres allocation score on srvplan2: 0
group_color: pgdrive_fs allocation score on srvplan1: 100
group_color: pgdrive_fs allocation score on srvplan2: 0
group_color: DBIP allocation score on srvplan1: 10150
group_color: DBIP allocation score on srvplan2: 10000
group_color: postgresql allocation score on srvplan1: 100
group_color: postgresql allocation score on srvplan2: 0
native_color: pgdrive_fs allocation score on srvplan1: 20450
native_color: pgdrive_fs allocation score on srvplan2: -INFINITY
clone_color: tomcats allocation score on srvplan1: -INFINITY
clone_color: tomcats allocation score on srvplan2: 0
clone_color: tomcat:0 allocation score on srvplan1: -INFINITY
clone_color: tomcat:0 allocation score on srvplan2: 0
clone_color: tomcat:1 allocation score on srvplan1: -INFINITY
clone_color: tomcat:1 allocation score on srvplan2: 100
native_color: tomcat:1 allocation score on srvplan1: -INFINITY
native_color: tomcat:1 allocation score on srvplan2: 100
native_color: tomcat:0 allocation score on srvplan1: -INFINITY
native_color: tomcat:0 allocation score on srvplan2: -INFINITY
native_color: DBIP allocation score on srvplan1: 9250
native_color: DBIP allocation score on srvplan2: -INFINITY
native_color: postgresql allocation score on srvplan1: 100
native_color: postgresql allocation score on srvplan2: -INFINITY
clone_color: pings allocation score on srvplan1: 0
clone_color: pings allocation score on srvplan2: 0
clone_color: ping:0 allocation score on srvplan1: 100
clone_color: ping:0 allocation score on srvplan2: 0
clone_color: ping:1 allocation score on srvplan1: 0
clone_color: ping:1 allocation score on srvplan2: 100
native_color: ping:0 allocation score on srvplan1: 100
native_color: ping:0 allocation score on srvplan2: 0
native_color: ping:1 allocation score on srvplan1: -INFINITY
native_color: ping:1 allocation score on srvplan2: 100

Why the score for the DBIP is -INFINITY on the srvplan2? The only INF
rule in my config is the collocation rule for the postgres group. I
can surmise that DBIP can't be run on srvplan2 because the DRBD isn't
Master there, but there's nothing preventing it from being promoted,
and this rule doesn't stop the DBIP from being moved in case of ping
failure either. So there must be something else.

I also don't quite understand why the DBIP score is 9250 on srvplan1.
It should be at least 10000 for the ping, and 250 more for preference
and stickiness. If I migrate the DBIP to srvplan2 manually, the score
is 10200 there, which makes me think that 1000 gets subtracted because
tomcat is stopped on srvplan1. But why? This is a positive rule, not a
negative one. It should just add 1000 if tomcat is running, but
shouldn't subtract anything if it isn't, am I wrong?

Does this have anything to do with the fact I'm trying to collocate
the IP with a clone? Or am I looking in the wrong direction?

I tried removing DBIP from the group, and it got moved to another
node. Obviously, everything else was left on the first one. Then I
tried adding a collocation of DBIP with postgres resources (and the
other way around), and if the score of that rules is high enough, the
IP gets moved back, but I never was able to get postgres moved on the
second node (where the IP is) instead.

-- 
Sergey A. Tachenov <stachenov at gmail.com>