[Pacemaker] Pacemaker mount failures

Fri May 30 07:17:00 EDT 2014

Hi

I wonder if anyone on the list can help me - I’m new to Pacemaker so apologies if I’m posting in the wrong place.

I have a four-node cluster running Pacemaker 1.1.10 with Corosync 1.4.1 on CentOS 6.4.  Resource-wise I have eight Lustre storage targets on an iSCSI SAN - two each colocated with a single heartbeat IP address on each node.  I have redundant Corosync rings and Stonith is configured, and failover in general works very well.  

My problem is that three of the storage targets refuse to mount via Pacemaker on particular nodes, for no particular reason I can identify.  These resources won’t start on the nodes they’re configured to in the constraints - which is fine if all nodes are up, but not if certain nodes fail.

If I stop the resources I can manually mount the targets on the node without any problem - so it seems to be a Pacemaker, rather than filesystem problem.

My resources look like this: http://pastebin.com/qQ1BR1yW and constraints like this: http://pastebin.com/4w85MWUV

crm_mon -f gives the following output:

Last updated: Fri May 30 12:02:59 2014
Last change: Fri May 30 12:02:38 2014 via crm_resource on oss-02
Stack: classic openais (with plugin)
Current DC: oss-02 - partition with quorum
Version: 1.1.10-14.el6_5.3-368c726
4 Nodes configured, 4 expected votes
16 Resources configured

Online: [ oss-01 oss-02 oss-03 oss-04 ]

ost-01  (ocf::heartbeat:Filesystem):    Started oss-01
ost-02  (ocf::heartbeat:Filesystem):    Started oss-02
stonith-oss-01  (stonith:fence_ipmilan):        Started oss-03
stonith-oss-02  (stonith:fence_ipmilan):        Started oss-04
ost-03  (ocf::heartbeat:Filesystem):    Started oss-04
stonith-oss-03  (stonith:fence_ipmilan):        Started oss-01
ost-05  (ocf::heartbeat:Filesystem):    Started oss-01
ost-06  (ocf::heartbeat:Filesystem):    Started oss-02
ost-07  (ocf::heartbeat:Filesystem):    Started oss-04
ost-04  (ocf::heartbeat:Filesystem):    Started oss-03
ost-08  (ocf::heartbeat:Filesystem):    Started oss-03
oss-01-hb	(ocf::heartbeat:IPaddr2):	Started oss-01
oss-02-hb	(ocf::heartbeat:IPaddr2):	Started oss-02
oss-03-hb	(ocf::heartbeat:IPaddr2):	Started oss-04
oss-04-hb	(ocf::heartbeat:IPaddr2):	Started oss-03
stonith-oss-04  (stonith:fence_ipmilan):        Started oss-02

Migration summary:
* Node oss-01: 
* Node oss-02: 
* Node oss-04: 
   ost-04: migration-threshold=1000000 fail-count=1000000 last-failure='Fri May 30 11:25:11 2014'
   ost-08: migration-threshold=1000000 fail-count=1000000 last-failure='Fri May 30 11:25:11 2014'
* Node oss-03: 
   ost-03: migration-threshold=1000000 fail-count=1000000 last-failure='Fri May 30 10:47:02 2014'

ost-03 is supposed to mount on oss-03, and ost-04 & ost-08 on oss-04, but they fail to do so and the colo-ed IP resources are therefore swapped between oss-03 and oss-04.

Log entries typically look like this, which doesn’t give me much to go on:

May 30 11:25:11 oss-04 lrmd[2179]:   notice: operation_finished: ost-08_start_0:2994:stderr [ mount.lustre: mount /dev/sdi at /lustre/ost-08 failed: Unknown error 524 ]

Does anyone know / can anyone suggest how I might debug why Pacemaker can’t mount these targets?

Many thanks
Stuart

Stuart Taylor
System Administrator
Edinburgh Genomics

Web: http://genomics.ed.ac.uk/
Tel: 0131 651 7403

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.