[Pacemaker] Pacemaker mount failures
Dejan Muhamedagic
dejanmm at fastmail.fm
Mon Jun 2 10:55:15 UTC 2014
Hi,
On Fri, May 30, 2014 at 12:17:00PM +0100, Stuart Taylor wrote:
> Hi
>
> I wonder if anyone on the list can help me - I’m new to Pacemaker so apologies if I’m posting in the wrong place.
>
> I have a four-node cluster running Pacemaker 1.1.10 with Corosync 1.4.1 on CentOS 6.4. Resource-wise I have eight Lustre storage targets on an iSCSI SAN - two each colocated with a single heartbeat IP address on each node. I have redundant Corosync rings and Stonith is configured, and failover in general works very well.
>
> My problem is that three of the storage targets refuse to mount via Pacemaker on particular nodes, for no particular reason I can identify. These resources won’t start on the nodes they’re configured to in the constraints - which is fine if all nodes are up, but not if certain nodes fail.
>
> If I stop the resources I can manually mount the targets on the node without any problem - so it seems to be a Pacemaker, rather than filesystem problem.
>
> My resources look like this: http://pastebin.com/qQ1BR1yW and constraints like this: http://pastebin.com/4w85MWUV
>
> crm_mon -f gives the following output:
>
> Last updated: Fri May 30 12:02:59 2014
> Last change: Fri May 30 12:02:38 2014 via crm_resource on oss-02
> Stack: classic openais (with plugin)
> Current DC: oss-02 - partition with quorum
> Version: 1.1.10-14.el6_5.3-368c726
> 4 Nodes configured, 4 expected votes
> 16 Resources configured
>
>
> Online: [ oss-01 oss-02 oss-03 oss-04 ]
>
> ost-01 (ocf::heartbeat:Filesystem): Started oss-01
> ost-02 (ocf::heartbeat:Filesystem): Started oss-02
> stonith-oss-01 (stonith:fence_ipmilan): Started oss-03
> stonith-oss-02 (stonith:fence_ipmilan): Started oss-04
> ost-03 (ocf::heartbeat:Filesystem): Started oss-04
> stonith-oss-03 (stonith:fence_ipmilan): Started oss-01
> ost-05 (ocf::heartbeat:Filesystem): Started oss-01
> ost-06 (ocf::heartbeat:Filesystem): Started oss-02
> ost-07 (ocf::heartbeat:Filesystem): Started oss-04
> ost-04 (ocf::heartbeat:Filesystem): Started oss-03
> ost-08 (ocf::heartbeat:Filesystem): Started oss-03
> oss-01-hb (ocf::heartbeat:IPaddr2): Started oss-01
> oss-02-hb (ocf::heartbeat:IPaddr2): Started oss-02
> oss-03-hb (ocf::heartbeat:IPaddr2): Started oss-04
> oss-04-hb (ocf::heartbeat:IPaddr2): Started oss-03
> stonith-oss-04 (stonith:fence_ipmilan): Started oss-02
>
> Migration summary:
> * Node oss-01:
> * Node oss-02:
> * Node oss-04:
> ost-04: migration-threshold=1000000 fail-count=1000000 last-failure='Fri May 30 11:25:11 2014'
> ost-08: migration-threshold=1000000 fail-count=1000000 last-failure='Fri May 30 11:25:11 2014'
> * Node oss-03:
> ost-03: migration-threshold=1000000 fail-count=1000000 last-failure='Fri May 30 10:47:02 2014'
>
> ost-03 is supposed to mount on oss-03, and ost-04 & ost-08 on oss-04, but they fail to do so and the colo-ed IP resources are therefore swapped between oss-03 and oss-04.
>
> Log entries typically look like this, which doesn’t give me much to go on:
>
> May 30 11:25:11 oss-04 lrmd[2179]: notice: operation_finished: ost-08_start_0:2994:stderr [ mount.lustre: mount /dev/sdi at /lustre/ost-08 failed: Unknown error 524 ]
The mount command obviously failed. Whatever the difference may
be between you mounting the filesystem by hand and the Filesystem
RA. And whatever error 524 means.
> Does anyone know / can anyone suggest how I might debug why Pacemaker can’t mount these targets?
Assuming you have recent enough resource-agents and crmsh, you
can trace the Filesystem RA, say:
# crm resource trace ost-08 start
This should make pacemaker try to start ost-08 again:
# crm resource cleanup ost-08
Then look for the trace file in /var/lib/heartbeat/trace_ra.
Alternatively, you can add 'set -x' somewhere in the Filesystem
RA, then look at the logs.
Thanks,
Dejan
>
> Many thanks
> Stuart
>
> Stuart Taylor
> System Administrator
> Edinburgh Genomics
>
> Web: http://genomics.ed.ac.uk/
> Tel: 0131 651 7403
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Pacemaker
mailing list