[Pacemaker] NFS resource isn't completely working

Wed Oct 24 20:59:34 EDT 2012

On Wed, Oct 17, 2012 at 8:30 AM, Lonni J Friedman <netllama at gmail.com> wrote:
> Greetings,
> I'm trying to get an NFS server export to be correctly monitored &
> managed by pacemaker, along with pre-existing IP, drbd and filesystem
> mounts (which are working correctly).  While NFS is up on the primary
> node (along with the other services), the monitoring portion keeps
> showing up as a failed action, reported as 'not running'.
>
> Here's my current configuration:
> ################
> node farm-ljf0 \
>         attributes standby="off"
> node farm-ljf1
> primitive ClusterIP ocf:heartbeat:IPaddr2 \
>         params ip="10.31.97.100" cidr_netmask="22" nic="eth1" \
>         op monitor interval="10s" \
>         meta target-role="Started"
> primitive FS0 ocf:linbit:drbd \
>         params drbd_resource="r0" \
>         op monitor interval="10s" role="Master" \
>         op monitor interval="30s" role="Slave"
> primitive FS0_drbd ocf:heartbeat:Filesystem \
>         params device="/dev/drbd0" directory="/mnt/sdb1" fstype="xfs" \
>         meta target-role="Started"
> primitive FS0_nfs systemd:nfs-server \
>         op monitor interval="10s" \
>         meta target-role="Started"
> group g_services ClusterIP FS0_drbd FS0_nfs
> ms FS0_Clone FS0 \
>         meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> colocation fs0_on_drbd inf: g_services FS0_Clone:Master
> order FS0_drbd-after-FS0 inf: FS0_Clone:promote g_services:start
> property $id="cib-bootstrap-options" \
>         dc-version="1.1.8-2.fc16-394e906" \
>         cluster-infrastructure="openais" \
>         expected-quorum-votes="2" \
>         stonith-enabled="false" \
>         no-quorum-policy="ignore"
> ################
>
> Here's the output from 'crm status'
> ################
> Last updated: Tue Oct 16 14:26:22 2012
> Last change: Tue Oct 16 14:23:18 2012 via cibadmin on farm-ljf1
> Stack: openais
> Current DC: farm-ljf1 - partition with quorum
> Version: 1.1.8-2.fc16-394e906
> 2 Nodes configured, 2 expected votes
> 5 Resources configured.
>
>
> Online: [ farm-ljf0 farm-ljf1 ]
>
>  Master/Slave Set: FS0_Clone [FS0]
>      Masters: [ farm-ljf1 ]
>      Slaves: [ farm-ljf0 ]
>  Resource Group: g_services
>      ClusterIP  (ocf::heartbeat:IPaddr2):       Started farm-ljf1
>      FS0_drbd   (ocf::heartbeat:Filesystem):    Started farm-ljf1
>      FS0_nfs    (systemd:nfs-server):   Started farm-ljf1
>
> Failed actions:
>     FS0_nfs_monitor_10000 (node=farm-ljf1, call=54357, rc=7,
> status=complete): not running
>     FS0_nfs_monitor_10000 (node=farm-ljf0, call=131365, rc=7,
> status=complete): not running
> ################
>
> When I check the cluster log, I'm seeing a bunch of this stuff:

Your logs start too late I'm afraid.
We need the earlier entries that show the job FS0_nfs_monitor_10000 failing.
Be sure to also check the system log file, since that will hopefully
have some information directly from systemd and/or nfs-server

> #############
> Oct 16 14:23:17 [924] farm-ljf0      attrd:   notice:
> attrd_trigger_update:   Sending flush op to all hosts for:
> fail-count-FS0_nfs (11939)
> Oct 16 14:23:17 [924] farm-ljf0      attrd:   notice:
> attrd_trigger_update:   Sending flush op to all hosts for:
> probe_complete (true)
> Oct 16 14:23:17 [924] farm-ljf0      attrd:   notice:
> attrd_ais_dispatch:     Update relayed from farm-ljf1
> Oct 16 14:23:17 [924] farm-ljf0      attrd:   notice:
> attrd_trigger_update:   Sending flush op to all hosts for:
> fail-count-FS0_nfs (11940)
> Oct 16 14:23:17 [924] farm-ljf0      attrd:   notice:
> attrd_perform_update:   Sent update 25471: fail-count-FS0_nfs=11940
> Oct 16 14:23:17 [924] farm-ljf0      attrd:   notice:
> attrd_ais_dispatch:     Update relayed from farm-ljf1
> Oct 16 14:23:20 [923] farm-ljf0       lrmd:     info:
> cancel_recurring_action:        Cancelling operation FS0_nfs_status_10000
> Oct 16 14:23:20 [926] farm-ljf0       crmd:     info:
> process_lrm_event:      LRM operation FS0_nfs_monitor_10000 (call=131365,
> status=1, cib-update=0, confirmed=false) Cancelled
> Oct 16 14:23:20 [923] farm-ljf0       lrmd:     info:
> systemd_unit_exec_done:         Call to stop passed: type '(o)'
> /org/freedesktop/systemd1/job/1062961
> Oct 16 14:23:20 [926] farm-ljf0       crmd:   notice:
> process_lrm_event:      LRM operation FS0_nfs_stop_0 (call=131369, rc=0,
> cib-update=35842, confirmed=true) ok
> #############
>
> I'm not sure what any of that means.  I'd appreciate some guidance.
>
> thanks!
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org