[Pacemaker] Odd problem with crm/Filesystem RA
Rob Thomas
xrobau at gmail.com
Mon Jun 20 05:45:13 UTC 2011
Executive overview: When bringing a node back from standby, to test
failover, the Filesytem RA on the _slave_ node, which has just
relinquished the resource, tries to mount the filesystem after it's
handed it back to the master, fails, and leaves the cluster with a
failure event that won't let it remount until cleared with
crm_resource --resource fs_asterisk -C
So, background: SL6 x86_64, drbd 8.4.0rc2, pacemaker-1.1.2-7.el6,,
resource-agents-3.0.12-15.el6
I've added some debugging to
/usr/lib/ocf/resource.d/heartbeat/Filesystem with a nice unique XYZZY
tag when Filesystem_start or Filesystem_stop is called.
# crm status (other resources snipped for readablity purposes)
============
Last updated: Mon Jun 20 15:02:05 2011
Stack: openais
Current DC: slave - partition with quorum
Version: 1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe
2 Nodes configured, 2 expected votes
10 Resources configured.
============
Online: [ master slave ]
Master/Slave Set: ms_drbd_asterisk
Masters: [ master ]
Slaves: [ slave ]
Resource Group: asterisk
fs_asterisk (ocf::heartbeat:Filesystem): Started master
ip_asterisk (ocf::heartbeat:IPaddr2): Started master
[...snip...]
# echo > /var/log/messages
# crm node standby master
# crm status
[root at localhost ~]# crm status
============
Last updated: Mon Jun 20 15:07:52 2011
Stack: openais
Current DC: slave - partition with quorum
Version: 1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe
2 Nodes configured, 2 expected votes
10 Resources configured.
============
Node master: standby
Online: [ slave ]
Master/Slave Set: ms_drbd_asterisk
Masters: [ slave ]
Stopped: [ drbd_asterisk:1 ]
Resource Group: asterisk
fs_asterisk (ocf::heartbeat:Filesystem): Started slave
ip_asterisk (ocf::heartbeat:IPaddr2): Started slave
(still no errors, all good)
# grep XYZZY /var/log/messages
Jun 20 15:07:37 localhost Filesystem[9879]: WARNING: XYZZY: Starting
Filesystem /dev/drbd1
Now, if I bring 'master' back online (and I've stickied that resource
to master, so it moves back straight away)
# echo > /var/log/messages
# crm node online master
# crm status
============
Last updated: Mon Jun 20 15:11:29 2011
Stack: openais
Current DC: slave - partition with quorum
Version: 1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe
2 Nodes configured, 2 expected votes
10 Resources configured.
============
Online: [ master slave ]
Master/Slave Set: ms_drbd_asterisk
Masters: [ master ]
Slaves: [ slave ]
Resource Group: asterisk
fs_asterisk (ocf::heartbeat:Filesystem): Started master
ip_asterisk (ocf::heartbeat:IPaddr2): Started master
Failed actions:
fs_asterisk_start_0 (node=slave, call=809, rc=1, status=complete):
unknown error
And grepping produces:
# grep XYZZY /var/log/messages
Jun 20 15:10:51 localhost Filesystem[13889]: WARNING: XYZZY: Stopping
Filesystem /dev/drbd1
Jun 20 15:10:56 localhost Filesystem[15338]: WARNING: XYZZY: Starting
Filesystem /dev/drbd1
Jun 20 15:10:58 localhost Filesystem[15593]: WARNING: XYZZY: Stopping
Filesystem /dev/drbd1
Oddly enough, there are actually _the same 3_ of the 5 groups doing
this, each time.
Complete 'crm configure show' output here: http://pastebin.com/0S7UVZ5U
(Note that I've tried it with a longer monitor time - 59s - and
without the --target-role=started on a couple of the resources.
Doing a 'crm status' whilst it's failing over you see it say 'FAILED:
slave' before it starts up on the master.
If there's any useful information I can provide from /var/log/messages
I'd be happy to provide them, but I just need to know which parts you
consider important, as there's tonnes of them!
--Rob
More information about the Pacemaker
mailing list