[Pacemaker] Problem with failover/failback under Ubuntu 10.04 for Active/Passive OpenNMS
Sven Wick
sven.wick at gmx.de
Mon Jul 5 10:01:17 UTC 2010
Hi,
I am trying to build an Active/Passive OpenNMS installation
but have problems with failover and failback.
Doing all manually (DRBD, VIP, Filesystem, ...) works just fine.
I read many tutorials, the official book and the FAQ
but am still stuck with my problem.
I use Ubuntu 10.04 with "pacemaker" and "corosync".
Hosts are 2 HP ProLiant G6 and are connected through a Cisco Switch.
The switch was configured to allow Multicast as described in[1].
Hostnames are "monitoring-node-01" and "monitoring-node-02"
and it seems I can failover to "monitoring-node-02" but not back.
The DRBD Init Script is disabled on both nodes.
First my current config:
crm configure show
node monitoring-node-01 \
attributes standby="off"
node monitoring-node-02 \
attributes standby="off"
primitive drbd-opennms-config ocf:linbit:drbd \
params drbd_resource="config" \
op start interval="0" timeout="300s" \
op stop interval="0" timeout="300s"
primitive drbd-opennms-data ocf:linbit:drbd \
params drbd_resource="data" \
op start interval="0" timeout="300s" \
op stop interval="0" timeout="300s"
primitive drbd-opennms-db ocf:linbit:drbd \
params drbd_resource="db" \
op start interval="0" timeout="300s" \
op stop interval="0" timeout="300s"
primitive fs-opennms-config ocf:heartbeat:Filesystem \
params device="/dev/drbd/by-res/config" directory="/etc/opennms"
fstype="xfs" \
op stop interval="0" timeout="300s" \
op start interval="0" timeout="300s"
primitive fs-opennms-data ocf:heartbeat:Filesystem \
params device="/dev/drbd/by-res/data" directory="/var/lib/opennms"
fstype="xfs" \
op stop interval="0" timeout="300s" \
op start interval="0" timeout="300s"
primitive fs-opennms-db ocf:heartbeat:Filesystem \
params device="/dev/drbd/by-res/db" directory="/var/lib/postgresql"
fstype="xfs" \
op stop interval="0" timeout="300s" \
op start interval="0" timeout="300s"
primitive opennms lsb:opennms \
op start interval="0" timeout="300s" \
op stop interval="0" timeout="300s" \
meta target-role="Started"
primitive postgres lsb:postgresql-8.4
primitive vip ocf:heartbeat:IPaddr2 \
params ip="172.24.25.20" cidr_netmask="24" nic="bond0"
group dependencies vip fs-opennms-config fs-opennms-db fs-opennms-data
postgres opennms
ms ms-opennms-config drbd-opennms-config \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
notify="true"
ms ms-opennms-data drbd-opennms-data \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
notify="true"
ms ms-opennms-db drbd-opennms-db \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
notify="true"
order promote-before-drbd-config inf: ms-opennms-config:promote
fs-opennms-config:start
order promote-before-drbd-data inf: ms-opennms-data:promote
fs-opennms-data:start
order promote-before-drbd-db inf: ms-opennms-db:promote
fs-opennms-db:start
property $id="cib-bootstrap-options" \
dc-version="1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
last-lrm-refresh="1278329430"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
Current state if all works:
monitoring-node-01
cat /proc/drbd
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by
root at monitoring-node-01, 2010-06-22 20:00:41
0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
ns:0 nr:0 dw:2052 dr:29542 al:2 bm:8 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b
oos:0
1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
ns:0 nr:0 dw:0 dr:200 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate A r----
ns:0 nr:0 dw:0 dr:200 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
df -h
Filesystem Size Used Avail Use% Mounted on
/dev/cciss/c0d0p1 270G 3.1G 253G 2% /
none 16G 212K 16G 1% /dev
none 16G 3.5M 16G 1% /dev/shm
none 16G 84K 16G 1% /var/run
none 16G 12K 16G 1% /var/lock
none 16G 0 16G 0% /lib/init/rw
/dev/drbd0 10G 34M 10G 1% /etc/opennms
/dev/drbd1 50G 67M 50G 1% /var/lib/postgresql
/dev/drbd2 214G 178M 214G 1% /var/lib/opennms
crm_mon
============
Last updated: Mon Jul 5 13:02:54 2010
Stack: openais
Current DC: monitoring-node-02 - partition with quorum
Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
2 Nodes configured, 2 expected votes
4 Resources configured.
============
Online: [ monitoring-node-01 monitoring-node-02 ]
Master/Slave Set: ms-opennms-data
Masters: [ monitoring-node-01 ]
Slaves: [ monitoring-node-02 ]
Master/Slave Set: ms-opennms-db
Masters: [ monitoring-node-01 ]
Slaves: [ monitoring-node-02 ]
Master/Slave Set: ms-opennms-config
Masters: [ monitoring-node-01 ]
Slaves: [ monitoring-node-02 ]
Resource Group: dependencies
vip (ocf::heartbeat:IPaddr2): Started monitoring-node-01
fs-opennms-config (ocf::heartbeat:Filesystem): Started
monitoring-node-01
fs-opennms-db (ocf::heartbeat:Filesystem): Started
monitoring-node-01
fs-opennms-data (ocf::heartbeat:Filesystem): Started
monitoring-node-01
postgres (lsb:postgresql-8.4): Started monitoring-node-01
opennms (lsb:opennms): Started monitoring-node-01
monitoring-node-02
cat /proc/drbd
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by
root at monitoring-node-02, 2010-06-22 19:59:35
0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----
ns:0 nr:0 dw:24620 dr:400 al:0 bm:4 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b
oos:0
1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----
ns:0 nr:302898 dw:302898 dr:400 al:0 bm:6 lo:0 pe:0 ua:0 ap:0 ep:1
wo:d oos:0
2: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r----
ns:0 nr:19712467 dw:19712467 dr:400 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1
wo:d oos:0
df -h
Filesystem Size Used Avail Use% Mounted on
/dev/cciss/c0d0p1 270G 1.8G 254G 1% /
none 16G 212K 16G 1% /dev
none 16G 3.6M 16G 1% /dev/shm
none 16G 72K 16G 1% /var/run
none 16G 12K 16G 1% /var/lock
none 16G 0 16G 0% /lib/init/rw
Rebooting "monitoring-node01" seems to work. All services run now
on "monitoring-node02" and were started in correct order.
But when I reboot "monitoring-node02":
crm_mon
Last updated: Mon Jul 5 13:21:31 2010
Stack: openais
Current DC: monitoring-node-01 - partition with quorum
Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
2 Nodes configured, 2 expected votes
4 Resources configured.
============
Online: [ monitoring-node-01 monitoring-node-02 ]
Master/Slave Set: ms-opennms-data
Masters: [ monitoring-node-01 ]
Slaves: [ monitoring-node-02 ]
Master/Slave Set: ms-opennms-db
Masters: [ monitoring-node-01 ]
Slaves: [ monitoring-node-02 ]
Master/Slave Set: ms-opennms-config
Masters: [ monitoring-node-01 ]
Slaves: [ monitoring-node-02 ]
Resource Group: dependencies
vip (ocf::heartbeat:IPaddr2): Started monitoring-node-02
fs-opennms-config (ocf::heartbeat:Filesystem): Stopped
fs-opennms-db (ocf::heartbeat:Filesystem): Stopped
fs-opennms-data (ocf::heartbeat:Filesystem): Stopped
postgres (lsb:postgresql-8.4): Stopped
opennms (lsb:opennms): Stopped
Failed actions:
fs-opennms-config_start_0 (node=monitoring-node-02, call=18, rc=1,
status=complete): unknown error fs-opennms-config_start_0 (node=m)
When I then type the following command:
crm resource cleanup fs-opennms-config
then the failover goes through and services are started on node01
crm_mon
============
Last updated: Mon Jul 5 13:31:31 2010
Stack: openais
Current DC: monitoring-node-01 - partition with quorum
Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
2 Nodes configured, 2 expected votes
4 Resources configured.
============
Online: [ monitoring-node-01 monitoring-node-02 ]
Master/Slave Set: ms-opennms-data
Masters: [ monitoring-node-01 ]
Slaves: [ monitoring-node-02 ]
Master/Slave Set: ms-opennms-db
Masters: [ monitoring-node-01 ]
Slaves: [ monitoring-node-02 ]
Master/Slave Set: ms-opennms-config
Masters: [ monitoring-node-01 ]
Slaves: [ monitoring-node-02 ]
Resource Group: dependencies
vip (ocf::heartbeat:IPaddr2): Started monitoring-node-01
fs-opennms-config (ocf::heartbeat:Filesystem): Started
monitoring-node-01
fs-opennms-db (ocf::heartbeat:Filesystem): Started
monitoring-node-01
fs-opennms-data (ocf::heartbeat:Filesystem): Started
monitoring-node-01
postgres (lsb:postgresql-8.4): Started monitoring-node-01
opennms (lsb:opennms): Started monitoring-node-01
There are some entries in my daemon.log[2] which look
if they have something to do with my problem...
monitoring-node-01 lrmd: [994]: info: rsc:drbd-opennms-data:1:30: promote
monitoring-node-01 crmd: [998]: info: do_lrm_rsc_op: Performing
key=93:25:0:d49b62be-1e33-48ca-a8c3-cb128676d444
op=fs-opennms-config_start_0 )
monitoring-node-01 lrmd: [994]: info: rsc:fs-opennms-config:31: start
monitoring-node-01 Filesystem[2464]: INFO: Running start for
/dev/drbd/by-res/config on /etc/opennms
monitoring-node-01 lrmd: [994]: info: RA output:
(fs-opennms-config:start:stderr) FATAL: Module scsi_hostadapter not found.
monitoring-node-01 lrmd: [994]: info: RA output:
(drbd-opennms-data:1:promote:stdout)
monitoring-node-01 crmd: [998]: info: process_lrm_event: LRM operation
drbd-opennms-data:1_promote_0 (call=30, rc=0, cib-update=35,
confirmed=true) ok
monitoring-node-01 lrmd: [994]: info: RA output:
(fs-opennms-config:start:stderr) /dev/drbd/by-res/config: Wrong medium
type
monitoring-node-01 lrmd: [994]: info: RA output:
(fs-opennms-config:start:stderr) mount: block device /dev/drbd0 is
write-protected, mounting read-only
monitoring-node-01 lrmd: [994]: info: RA output:
(fs-opennms-config:start:stderr) mount: Wrong medium type
monitoring-node-01 Filesystem[2464]: ERROR: Couldn't mount filesystem
/dev/drbd/by-res/config on /etc/opennms
monitoring-node-01 crmd: [998]: info: process_lrm_event: LRM operation
fs-opennms-config_start_0 (call=31, rc=1, cib-update=36, confirmed=true)
unknown error
...but I don't know how to troubleshoot it.
[1] http://www.corosync.org/doku.php?id=faq:cisco_switches
[2] http://pastebin.com/DKLjXtx8
More information about the Pacemaker
mailing list