[Pacemaker] advisory ordering question

Thu May 27 12:41:16 EDT 2010

On Tue, May 25, 2010 at 3:39 PM, Dejan Muhamedagic <dejanmm at fastmail.fm>wrote:

> Hi,
> [snip]
> > So I presume the problem could be caused by the fact that the second part
> is
> > a clone and not a resource? or a bug?
> > I can eventually send the whole config.
>
> Looks like a bug to me. Clone or resource, constraints should be
> observed. Perhaps it's a duplicate of this one:
> http://developerbugs.linux-foundation.org/show_bug.cgi?id=2422
>
>
I don't know if it is the same bug...
In the sense that in your bug case the problem is related to:
- the stop operation not enforcing the order constraint
- the constraint is between a clone of a group and a clone of a resource

In my case the problem arises in:
- the start operation not enforcing the order constraint
- the constraint is between a group and a clone of a group
Verified also when
- the constraint is between a resource and a clone of a group

Clone of a group is the second part of the two. Not tested in reverse.

So probably a common root cause for the problems, if exists, could be
related with the clone part...
In the meantime, in my case, I put this workaround:
I set my nfsclient resource with the mount options "bg,intr",
the important part here being the bg one.
But I think in general also the intr option could be useful in case the
cluster needs to do some operations (no soft option, so in my case hard
should be the default, avoiding possible data corruption, correct?)

So now it is:
primitive nfsclient ocf:heartbeat:Filesystem \
        params device="nfssrv:/nfsdata/web" directory="/usr/local/data"
fstype="nfs" options="intr,bg" \
        op monitor interval="30" timeout="40" depth="0" \
        op start interval="0" timeout="60" \
        op stop interval="0" timeout="60"

this way the start of the nfsclient succeeds (in terms of return codes ;-)
then the first monitor for nfsclient fails and then the consequent
stop/start part of nfsclient succeeds, so that I can start successfully the
resources...

See below the related lines that show the behavior.
At the end my cluster  status, after I start corosync on one node (the other
node being powered on but with corosync stopped) is now correctly:

root at node1 ~]# crm_mon -fr1
============
Last updated: Thu May 27 17:56:52 2010
Stack: openais
Current DC: node1. - partition WITHOUT quorum
Version: 1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7
2 Nodes configured, 2 expected votes
4 Resources configured.
============

Online: [ node1. ]
OFFLINE: [ node2. ]

Full list of resources:

 Master/Slave Set: NfsData
     Masters: [ node1. ]
     Stopped: [ nfsdrbd:1 ]
 Resource Group: nfs-group
     lv_nfsdata_drbd    (ocf::heartbeat:LVM):   Started
node1.
     NfsFS      (ocf::heartbeat:Filesystem):    Started
node1.
     VIPlbtest  (ocf::heartbeat:IPaddr2):       Started
node1.
     nfssrv     (ocf::heartbeat:nfsserver):     Started
node1.
 Clone Set: cl-pinggw
     Started: [ node1. ]
     Stopped: [ pinggw:1 ]
 Clone Set: apache_clone
     Started: [ node1. ]
     Stopped: [ apache_group:1 ]

Migration summary:
* Node node1.:  pingd=200
   nfsclient:0: migration-threshold=1000000 fail-count=1

May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Promote
nfsdrbd:0       (Slave -> Master node1.)
May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource
nfsdrbd:1        (Stopped)
May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource
lv_nfsdata_drbd  (Stopped)
May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource
NfsFS    (Stopped)
May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource
VIPlbtest        (Stopped)
May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource
nfssrv   (Stopped)
May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource
pinggw:0 (Started node1.)
May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource
pinggw:1 (Stopped)
May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Start
nfsclient:0       (node1.)
May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Start
apachepoli:0      (node1.)
May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource
nfsclient:1      (Stopped)
May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource
apachepoli:1     (Stopped)
...
May 27 17:26:30 node1 lrmd: [32371]: info: rsc:nfsdrbd:0:14: notify
...
May 27 17:26:30 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing
key=43:2:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=nfsclient:0_start_0 )
May 27 17:26:30 node1 lrmd: [32371]: info: rsc:nfsclient:0:15: start
May 27 17:26:30 node1 Filesystem[32729]: INFO: Running start for
viplbtest.:/nfsdata/web on /usr/local/data
May 27 17:26:30 node1 crmd: [32374]: info: process_lrm_event: LRM operation
nfsdrbd:0_notify_0 (call=14, rc=0, cib-update=41, confirmed=true) ok
May 27 17:26:30 node1 crmd: [32374]: info: match_graph_event: Action
nfsdrbd:0_pre_notify_promote_0 (85) confirmed on node1. (rc=0)
May 27 17:26:30 node1 crmd: [32374]: info: te_pseudo_action: Pseudo action
23 fired and confirmed
...
May 27 17:26:30 node1 kernel: block drbd0: role( Secondary -> Primary )
May 27 17:26:30 node1 lrmd: [32371]: info: RA output:
(nfsdrbd:0:promote:stdout)
May 27 17:26:30 node1 crmd: [32374]: info: process_lrm_event: LRM operation
nfsdrbd:0_promote_0 (call=16, rc=0, cib-update=42, confirmed=true) ok
May 27 17:26:30 node1 crmd: [32374]: info: match_graph_event: Action
nfsdrbd:0_promote_0 (7) confirmed on node1. (rc=0)
...
May 27 17:26:43 node1 crmd: [32374]: info: process_lrm_event: LRM operation
nfsclient:0_start_0 (call=15, rc=0, cib-update=45, confirmed=true) ok
May 27 17:26:43 node1 crmd: [32374]: info: match_graph_event: Action
nfsclient:0_start_0 (43) confirmed on node1. (rc=0)

...
May 27 17:26:43 node1 pengine: [32373]: info: determine_online_status: Node
node1. is online
May 27 17:26:43 node1 pengine: [32373]: notice: clone_print:  Master/Slave
Set: NfsData
May 27 17:26:43 node1 pengine: [32373]: notice: short_print:      Masters: [
node1. ]
May 27 17:26:43 node1 pengine: [32373]: notice: short_print:      Stopped: [
nfsdrbd:1 ]
May 27 17:26:43 node1 pengine: [32373]: notice: group_print:  Resource
Group: nfs-group
May 27 17:26:43 node1 pengine: [32373]: notice: native_print:
 lv_nfsdata_drbd  (ocf::heartbeat:LVM):   Stopped
May 27 17:26:43 node1 pengine: [32373]: notice: native_print:      NfsFS
 (ocf::heartbeat:Filesystem):    Stopped
May 27 17:26:43 node1 pengine: [32373]: notice: native_print:      VIPlbtest
       (ocf::heartbeat:IPaddr2):       Stopped
May 27 17:26:43 node1 pengine: [32373]: notice: native_print:      nfssrv
(ocf::heartbeat:nfsserver):     Stopped
May 27 17:26:43 node1 pengine: [32373]: notice: clone_print:  Clone Set:
cl-pinggw
May 27 17:26:43 node1 pengine: [32373]: notice: short_print:      Started: [
node1. ]
May 27 17:26:43 node1 pengine: [32373]: notice: short_print:      Stopped: [
pinggw:1 ]
May 27 17:26:43 node1 pengine: [32373]: notice: clone_print:  Clone Set:
apache_clone
May 27 17:26:43 node1 pengine: [32373]: notice: group_print:      Resource
Group: apache_group:0
May 27 17:26:43 node1 pengine: [32373]: notice: native_print:
 nfsclient:0  (ocf::heartbeat:Filesystem):    Started node1.
May 27 17:26:43 node1 pengine: [32373]: notice: native_print:
 apachepoli:0 (ocf::poli:apache2):    Stopped
May 27 17:26:43 node1 pengine: [32373]: notice: short_print:      Stopped: [
apache_group:1 ]
...
May 27 17:26:43 node1 pengine: [32373]: info: master_color: Promoting
nfsdrbd:0 (Master node1.)
May 27 17:26:43 node1 pengine: [32373]: info: master_color: NfsData:
Promoted 1 instances of a possible 1 to master
...
May 27 17:26:43 node1 lrmd: [32371]: info: rsc:lv_nfsdata_drbd:18: start
May 27 17:26:43 node1 crmd: [32374]: info: te_rsc_command: Initiating action
51: monitor nfsclient:0_monitor_30000 on node1. (local)
May 27 17:26:43 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing
key=51:3:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=nfsclient:0_monitor_30000
)
May 27 17:26:43 node1 lrmd: [32371]: info: rsc:nfsclient:0:19: monitor
May 27 17:26:43 node1 LVM[411]: INFO: Activating volume group
vg_nfsdata_drbd
May 27 17:26:43 node1 crmd: [32374]: info: process_lrm_event: LRM operation
nfsclient:0_monitor_30000 (call=19, rc=7, cib-update=47, confirmed=false)
not running
May 27 17:26:43 node1 crmd: [32374]: WARN: status_from_rc: Action 51
(nfsclient:0_monitor_30000) on node1. failed (target: 0 vs. rc: 7): Error
May 27 17:26:43 node1 crmd: [32374]: WARN: update_failcount: Updating
failcount for nfsclient:0 on node1. after failed monitor: rc=7
(update=value++, time=1274974003)
...
May 27 17:26:43 node1 crmd: [32374]: info: match_graph_event: Action
nfsclient:0_monitor_30000 (51) confirmed on node1. (rc=4)
May 27 17:26:43 node1 attrd: [32372]: info: find_hash_entry: Creating hash
entry for fail-count-nfsclient:0
May 27 17:26:43 node1 attrd: [32372]: info: attrd_local_callback: Expanded
fail-count-nfsclient:0=value++ to 1
May 27 17:26:43 node1 attrd: [32372]: info: attrd_trigger_update: Sending
flush op to all hosts for: fail-count-nfsclient:0 (1)
May 27 17:26:43 node1 attrd: [32372]: info: attrd_perform_update: Sent
update 17: fail-count-nfsclient:0=1
...
May 27 17:26:43 node1 LVM[411]: INFO: Reading all physical volumes. This may
take a while... Found volume group "VolGroup00" using metadata type lvm2
Found volume group "vg_nfsdata_drbd" using metadata type lvm2
May 27 17:26:44 node1 LVM[411]: INFO: 1 logical volume(s) in volume group
"vg_nfsdata_drbd" now active
May 27 17:26:44 node1 crmd: [32374]: info: process_lrm_event: LRM operation
lv_nfsdata_drbd_start_0 (call=18, rc=0, cib-update=48, confirmed=true) ok
May 27 17:26:44 node1 crmd: [32374]: info: match_graph_event: Action
lv_nfsdata_drbd_start_0 (33) confirmed on node1. (rc=0)
...
May 27 17:26:44 node1 pengine: [32373]: info: determine_online_status: Node
node1. is online
May 27 17:26:44 node1 pengine: [32373]: WARN: unpack_rsc_op: Processing
failed op nfsclient:0_monitor_30000 on node1.: not running (7)
May 27 17:26:44 node1 pengine: [32373]: notice: clone_print:  Master/Slave
Set: NfsData
May 27 17:26:44 node1 pengine: [32373]: notice: short_print:      Masters: [
node1. ]
May 27 17:26:44 node1 pengine: [32373]: notice: short_print:      Stopped: [
nfsdrbd:1 ]
May 27 17:26:44 node1 pengine: [32373]: notice: group_print:  Resource
Group: nfs-group
May 27 17:26:44 node1 pengine: [32373]: notice: native_print:
 lv_nfsdata_drbd  (ocf::heartbeat:LVM):   Started node1.
May 27 17:26:44 node1 pengine: [32373]: notice: native_print:      NfsFS
 (ocf::heartbeat:Filesystem):    Stopped
May 27 17:26:44 node1 pengine: [32373]: notice: native_print:      VIPlbtest
       (ocf::heartbeat:IPaddr2):       Stopped
May 27 17:26:44 node1 pengine: [32373]: notice: native_print:      nfssrv
(ocf::heartbeat:nfsserver):     Stopped
May 27 17:26:44 node1 pengine: [32373]: notice: clone_print:  Clone Set:
cl-pinggw
May 27 17:26:44 node1 pengine: [32373]: notice: short_print:      Started: [
node1. ]
May 27 17:26:44 node1 pengine: [32373]: notice: short_print:      Stopped: [
pinggw:1 ]
May 27 17:26:44 node1 pengine: [32373]: notice: clone_print:  Clone Set:
apache_clone
May 27 17:26:44 node1 pengine: [32373]: notice: group_print:      Resource
Group: apache_group:0
May 27 17:26:44 node1 pengine: [32373]: notice: native_print:
 nfsclient:0  (ocf::heartbeat:Filesystem):    Started node1. FAILED
May 27 17:26:44 node1 pengine: [32373]: notice: native_print:
 apachepoli:0 (ocf::poli:apache2):    Stopped
May 27 17:26:44 node1 pengine: [32373]: notice: short_print:      Stopped: [
apache_group:1 ]
May 27 17:26:44 node1 pengine: [32373]: info: get_failcount: apache_clone
has failed 1 times on node1.
May 27 17:26:44 node1 pengine: [32373]: notice: common_apply_stickiness:
apache_clone can fail 999999 more times on node1. before being forced off
...
May 27 17:26:44 node1 pengine: [32373]: notice: LogActions: Recover resource
nfsclient:0    (Started node1.)
May 27 17:26:44 node1 pengine: [32373]: notice: LogActions: Start
apachepoli:0      (node1.)
...
May 27 17:26:44 node1 crmd: [32374]: info: te_rsc_command: Initiating action
38: start NfsFS_start_0 on node1. (local)
May 27 17:26:44 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing
key=38:4:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=NfsFS_start_0 )
May 27 17:26:44 node1 lrmd: [32371]: info: rsc:NfsFS:21: start
...
May 27 17:26:44 node1 crmd: [32374]: info: te_rsc_command: Initiating action
3: stop nfsclient:0_stop_0 on node1. (local)
...
May 27 17:26:44 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing
key=3:4:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=nfsclient:0_stop_0 )
May 27 17:26:44 node1 lrmd: [32371]: info: rsc:nfsclient:0:22: stop
...
May 27 17:26:44 node1 Filesystem[493]: INFO: Running start for
/dev/vg_nfsdata_drbd/lv_nfsdata_drbd on /nfsdata
May 27 17:26:44 node1 Filesystem[502]: INFO: Running stop for
viplbtest.:/nfsdata/web on /usr/local/data
May 27 17:26:44 node1 crmd: [32374]: info: process_lrm_event: LRM operation
nfsclient:0_stop_0 (call=22, rc=0, cib-update=50, confirmed=true) ok
May 27 17:26:44 node1 crmd: [32374]: info: match_graph_event: Action
nfsclient:0_stop_0 (3) confirmed on node1. (rc=0)
...
May 27 17:26:44 node1 kernel: kjournald starting.  Commit interval 5 seconds
May 27 17:26:44 node1 kernel: EXT3-fs warning: maximal mount count reached,
running e2fsck is recommended
May 27 17:26:44 node1 kernel: EXT3 FS on dm-0, internal journal
May 27 17:26:44 node1 kernel: EXT3-fs: mounted filesystem with ordered data
mode.
May 27 17:26:44 node1 crmd: [32374]: info: process_lrm_event: LRM operation
NfsFS_start_0 (call=21, rc=0, cib-update=51, confirmed=true) ok
...
May 27 17:26:44 node1 crmd: [32374]: info: te_rsc_command: Initiating action
39: start VIPlbtest_start_0 on node1. (local)
May 27 17:26:44 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing
key=39:4:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=VIPlbtest_start_0 )
May 27 17:26:44 node1 lrmd: [32371]: info: rsc:VIPlbtest:23: start
...
May 27 17:26:44 node1 crmd: [32374]: info: process_lrm_event: LRM operation
VIPlbtest_start_0 (call=23, rc=0, cib-update=53, confirmed=true) ok
...
May 27 17:26:44 node1 crmd: [32374]: info: te_rsc_command: Initiating action
41: start nfssrv_start_0 on node1. (local)
May 27 17:26:44 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing
key=41:4:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=nfssrv_start_0 )
May 27 17:26:44 node1 lrmd: [32371]: info: rsc:nfssrv:25: start
...
May 27 17:26:44 node1 nfsserver[661]: INFO: Starting NFS server ...
May 27 17:26:44 node1 kernel: NFSD: Using /var/lib/nfs/v4recovery as the
NFSv4 state recovery directory
May 27 17:26:44 node1 kernel: NFSD: starting 90-second grace period
May 27 17:26:44 node1 lrmd: [32371]: info: RA output: (nfssrv:start:stdout)
rpc.statd version 1.0.9
May 27 17:26:44 node1 nfsserver[661]: INFO: NFS server started
May 27 17:26:44 node1 crmd: [32374]: info: process_lrm_event: LRM operation
nfssrv_start_0 (call=25, rc=0, cib-update=55, confirmed=true) ok
...
May 27 17:26:44 node1 crmd: [32374]: info: te_rsc_command: Initiating action
52: start nfsclient:0_start_0 on node1. (local)
May 27 17:26:44 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing
key=52:4:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=nfsclient:0_start_0 )
May 27 17:26:44 node1 lrmd: [32371]: info: rsc:nfsclient:0:26: start
May 27 17:26:45 node1 Filesystem[771]: INFO: Running start for
viplbtest.:/nfsdata/web on /usr/local/data
May 27 17:26:46 node1 mountd[746]: authenticated mount request from
10.4.4.140:921 for /nfsdata/web (/nfsdata/web)
May 27 17:26:46 node1 mountd[746]: authenticated mount request from
10.4.4.140:999 for /nfsdata/web (/nfsdata/web)
May 27 17:26:46 node1 crmd: [32374]: info: process_lrm_event: LRM operation
nfsclient:0_start_0 (call=26, rc=0, cib-update=56, confirmed=true) ok
May 27 17:26:46 node1 crmd: [32374]: info: match_graph_event: Action
nfsclient:0_start_0 (52) confirmed on node1. (rc=0)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100527/c2a7ad3e/attachment-0001.html>