[Pacemaker] Odd problem with crm/Filesystem RA

Mon Jun 20 05:45:13 UTC 2011

Executive overview: When bringing a node back from standby, to test
failover, the Filesytem RA on the _slave_ node, which has just
relinquished the resource, tries to mount the filesystem after it's
handed it back to the master, fails, and leaves the cluster with a
failure event that won't let it remount until cleared with
crm_resource --resource fs_asterisk -C

So, background:  SL6 x86_64, drbd 8.4.0rc2, pacemaker-1.1.2-7.el6,,
resource-agents-3.0.12-15.el6

I've added some debugging to
/usr/lib/ocf/resource.d/heartbeat/Filesystem with a nice unique XYZZY
tag when Filesystem_start or Filesystem_stop is called.

# crm status  (other resources snipped for readablity purposes)
============
Last updated: Mon Jun 20 15:02:05 2011
Stack: openais
Current DC: slave - partition with quorum
Version: 1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe
2 Nodes configured, 2 expected votes
10 Resources configured.
============

Online: [ master slave ]

 Master/Slave Set: ms_drbd_asterisk
     Masters: [ master ]
     Slaves: [ slave ]
 Resource Group: asterisk
     fs_asterisk   (ocf::heartbeat:Filesystem):    Started master
     ip_asterisk   (ocf::heartbeat:IPaddr2):       Started master

[...snip...]

# echo > /var/log/messages
# crm node standby master
# crm status
[root at localhost ~]# crm status
============
Last updated: Mon Jun 20 15:07:52 2011
Stack: openais
Current DC: slave - partition with quorum
Version: 1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe
2 Nodes configured, 2 expected votes
10 Resources configured.
============

Node master: standby
Online: [ slave ]

 Master/Slave Set: ms_drbd_asterisk
     Masters: [ slave ]
     Stopped: [ drbd_asterisk:1 ]
 Resource Group: asterisk
     fs_asterisk   (ocf::heartbeat:Filesystem):    Started slave
     ip_asterisk   (ocf::heartbeat:IPaddr2):       Started slave

(still no errors, all good)

# grep XYZZY /var/log/messages
Jun 20 15:07:37 localhost Filesystem[9879]: WARNING: XYZZY: Starting
Filesystem /dev/drbd1

Now, if I bring 'master' back online (and I've stickied that resource
to master, so it moves back straight away)

# echo > /var/log/messages
# crm node online master
# crm status
============
Last updated: Mon Jun 20 15:11:29 2011
Stack: openais
Current DC: slave - partition with quorum
Version: 1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe
2 Nodes configured, 2 expected votes
10 Resources configured.
============

Online: [ master slave ]

 Master/Slave Set: ms_drbd_asterisk
     Masters: [ master ]
     Slaves: [ slave ]
 Resource Group: asterisk
     fs_asterisk   (ocf::heartbeat:Filesystem):    Started master
     ip_asterisk   (ocf::heartbeat:IPaddr2):       Started master

Failed actions:
    fs_asterisk_start_0 (node=slave, call=809, rc=1, status=complete):
unknown error

And grepping produces:
# grep XYZZY /var/log/messages
Jun 20 15:10:51 localhost Filesystem[13889]: WARNING: XYZZY: Stopping
Filesystem /dev/drbd1
Jun 20 15:10:56 localhost Filesystem[15338]: WARNING: XYZZY: Starting
Filesystem /dev/drbd1
Jun 20 15:10:58 localhost Filesystem[15593]: WARNING: XYZZY: Stopping
Filesystem /dev/drbd1

Oddly enough, there are actually _the same 3_ of the 5 groups doing
this, each time.

Complete 'crm configure show' output here: http://pastebin.com/0S7UVZ5U

(Note that I've tried it with a longer monitor time - 59s - and
without the --target-role=started on a couple of the resources.

Doing a 'crm status' whilst it's failing over you see it say 'FAILED:
slave' before it starts up on the master.

If there's any useful information I can provide from /var/log/messages
I'd be happy to provide them, but I just need to know which parts you
consider important, as there's tonnes of them!

--Rob