[Pacemaker] advisory ordering question

Fri May 28 02:00:18 EDT 2010

On Thu, May 27, 2010 at 6:41 PM, Gianluca Cecchi
<gianluca.cecchi at gmail.com> wrote:
> On Tue, May 25, 2010 at 3:39 PM, Dejan Muhamedagic <dejanmm at fastmail.fm>
> wrote:
>>
>> Hi,
>> [snip]
>> > So I presume the problem could be caused by the fact that the second
>> > part is
>> > a clone and not a resource? or a bug?
>> > I can eventually send the whole config.
>>
>> Looks like a bug to me. Clone or resource, constraints should be
>> observed. Perhaps it's a duplicate of this one:
>> http://developerbugs.linux-foundation.org/show_bug.cgi?id=2422
>>
>
> I don't know if it is the same bug...
> In the sense that in your bug case the problem is related to:
> - the stop operation not enforcing the order constraint
> - the constraint is between a clone of a group and a clone of a resource
> In my case the problem arises in:
> - the start operation not enforcing the order constraint
> - the constraint is between a group and a clone of a group
> Verified also when
> - the constraint is between a resource and a clone of a group

Can you submit a bug report and include a hb_report for this scenario please?

> Clone of a group is the second part of the two. Not tested in reverse.
> So probably a common root cause for the problems, if exists, could be
> related with the clone part...
> In the meantime, in my case, I put this workaround:
> I set my nfsclient resource with the mount options "bg,intr",
> the important part here being the bg one.
> But I think in general also the intr option could be useful in case the
> cluster needs to do some operations (no soft option, so in my case hard
> should be the default, avoiding possible data corruption, correct?)
> So now it is:
> primitive nfsclient ocf:heartbeat:Filesystem \
>         params device="nfssrv:/nfsdata/web" directory="/usr/local/data"
> fstype="nfs" options="intr,bg" \
>         op monitor interval="30" timeout="40" depth="0" \
>         op start interval="0" timeout="60" \
>         op stop interval="0" timeout="60"
> this way the start of the nfsclient succeeds (in terms of return codes ;-)
> then the first monitor for nfsclient fails and then the consequent
> stop/start part of nfsclient succeeds, so that I can start successfully the
> resources...
> See below the related lines that show the behavior.
> At the end my cluster  status, after I start corosync on one node (the other
> node being powered on but with corosync stopped) is now correctly:
> root at node1 ~]# crm_mon -fr1
> ============
> Last updated: Thu May 27 17:56:52 2010
> Stack: openais
> Current DC: node1. - partition WITHOUT quorum
> Version: 1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7
> 2 Nodes configured, 2 expected votes
> 4 Resources configured.
> ============
> Online: [ node1. ]
> OFFLINE: [ node2. ]
> Full list of resources:
>  Master/Slave Set: NfsData
>      Masters: [ node1. ]
>      Stopped: [ nfsdrbd:1 ]
>  Resource Group: nfs-group
>      lv_nfsdata_drbd    (ocf::heartbeat:LVM):   Started
> node1.
>      NfsFS      (ocf::heartbeat:Filesystem):    Started
> node1.
>      VIPlbtest  (ocf::heartbeat:IPaddr2):       Started
> node1.
>      nfssrv     (ocf::heartbeat:nfsserver):     Started
> node1.
>  Clone Set: cl-pinggw
>      Started: [ node1. ]
>      Stopped: [ pinggw:1 ]
>  Clone Set: apache_clone
>      Started: [ node1. ]
>      Stopped: [ apache_group:1 ]
> Migration summary:
> * Node node1.:  pingd=200
>    nfsclient:0: migration-threshold=1000000 fail-count=1
> May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Promote
> nfsdrbd:0       (Slave -> Master node1.)
> May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource
> nfsdrbd:1        (Stopped)
> May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource
> lv_nfsdata_drbd  (Stopped)
> May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource
> NfsFS    (Stopped)
> May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource
> VIPlbtest        (Stopped)
> May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource
> nfssrv   (Stopped)
> May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource
> pinggw:0 (Started node1.)
> May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource
> pinggw:1 (Stopped)
> May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Start
> nfsclient:0       (node1.)
> May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Start
> apachepoli:0      (node1.)
> May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource
> nfsclient:1      (Stopped)
> May 27 17:26:30 node1 pengine: [32373]: notice: LogActions: Leave resource
> apachepoli:1     (Stopped)
> ...
> May 27 17:26:30 node1 lrmd: [32371]: info: rsc:nfsdrbd:0:14: notify
> ...
> May 27 17:26:30 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing
> key=43:2:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=nfsclient:0_start_0 )
> May 27 17:26:30 node1 lrmd: [32371]: info: rsc:nfsclient:0:15: start
> May 27 17:26:30 node1 Filesystem[32729]: INFO: Running start for
> viplbtest.:/nfsdata/web on /usr/local/data
> May 27 17:26:30 node1 crmd: [32374]: info: process_lrm_event: LRM operation
> nfsdrbd:0_notify_0 (call=14, rc=0, cib-update=41, confirmed=true) ok
> May 27 17:26:30 node1 crmd: [32374]: info: match_graph_event: Action
> nfsdrbd:0_pre_notify_promote_0 (85) confirmed on node1. (rc=0)
> May 27 17:26:30 node1 crmd: [32374]: info: te_pseudo_action: Pseudo action
> 23 fired and confirmed
> ...
> May 27 17:26:30 node1 kernel: block drbd0: role( Secondary -> Primary )
> May 27 17:26:30 node1 lrmd: [32371]: info: RA output:
> (nfsdrbd:0:promote:stdout)
> May 27 17:26:30 node1 crmd: [32374]: info: process_lrm_event: LRM operation
> nfsdrbd:0_promote_0 (call=16, rc=0, cib-update=42, confirmed=true) ok
> May 27 17:26:30 node1 crmd: [32374]: info: match_graph_event: Action
> nfsdrbd:0_promote_0 (7) confirmed on node1. (rc=0)
> ...
> May 27 17:26:43 node1 crmd: [32374]: info: process_lrm_event: LRM operation
> nfsclient:0_start_0 (call=15, rc=0, cib-update=45, confirmed=true) ok
> May 27 17:26:43 node1 crmd: [32374]: info: match_graph_event: Action
> nfsclient:0_start_0 (43) confirmed on node1. (rc=0)
> ...
> May 27 17:26:43 node1 pengine: [32373]: info: determine_online_status: Node
> node1. is online
> May 27 17:26:43 node1 pengine: [32373]: notice: clone_print:  Master/Slave
> Set: NfsData
> May 27 17:26:43 node1 pengine: [32373]: notice: short_print:      Masters: [
> node1. ]
> May 27 17:26:43 node1 pengine: [32373]: notice: short_print:      Stopped: [
> nfsdrbd:1 ]
> May 27 17:26:43 node1 pengine: [32373]: notice: group_print:  Resource
> Group: nfs-group
> May 27 17:26:43 node1 pengine: [32373]: notice: native_print:
>  lv_nfsdata_drbd  (ocf::heartbeat:LVM):   Stopped
> May 27 17:26:43 node1 pengine: [32373]: notice: native_print:      NfsFS
>  (ocf::heartbeat:Filesystem):    Stopped
> May 27 17:26:43 node1 pengine: [32373]: notice: native_print:      VIPlbtest
>        (ocf::heartbeat:IPaddr2):       Stopped
> May 27 17:26:43 node1 pengine: [32373]: notice: native_print:      nfssrv
> (ocf::heartbeat:nfsserver):     Stopped
> May 27 17:26:43 node1 pengine: [32373]: notice: clone_print:  Clone Set:
> cl-pinggw
> May 27 17:26:43 node1 pengine: [32373]: notice: short_print:      Started: [
> node1. ]
> May 27 17:26:43 node1 pengine: [32373]: notice: short_print:      Stopped: [
> pinggw:1 ]
> May 27 17:26:43 node1 pengine: [32373]: notice: clone_print:  Clone Set:
> apache_clone
> May 27 17:26:43 node1 pengine: [32373]: notice: group_print:      Resource
> Group: apache_group:0
> May 27 17:26:43 node1 pengine: [32373]: notice: native_print:
>  nfsclient:0  (ocf::heartbeat:Filesystem):    Started node1.
> May 27 17:26:43 node1 pengine: [32373]: notice: native_print:
>  apachepoli:0 (ocf::poli:apache2):    Stopped
> May 27 17:26:43 node1 pengine: [32373]: notice: short_print:      Stopped: [
> apache_group:1 ]
> ...
> May 27 17:26:43 node1 pengine: [32373]: info: master_color: Promoting
> nfsdrbd:0 (Master node1.)
> May 27 17:26:43 node1 pengine: [32373]: info: master_color: NfsData:
> Promoted 1 instances of a possible 1 to master
> ...
> May 27 17:26:43 node1 lrmd: [32371]: info: rsc:lv_nfsdata_drbd:18: start
> May 27 17:26:43 node1 crmd: [32374]: info: te_rsc_command: Initiating action
> 51: monitor nfsclient:0_monitor_30000 on node1. (local)
> May 27 17:26:43 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing
> key=51:3:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=nfsclient:0_monitor_30000
> )
> May 27 17:26:43 node1 lrmd: [32371]: info: rsc:nfsclient:0:19: monitor
> May 27 17:26:43 node1 LVM[411]: INFO: Activating volume group
> vg_nfsdata_drbd
> May 27 17:26:43 node1 crmd: [32374]: info: process_lrm_event: LRM operation
> nfsclient:0_monitor_30000 (call=19, rc=7, cib-update=47, confirmed=false)
> not running
> May 27 17:26:43 node1 crmd: [32374]: WARN: status_from_rc: Action 51
> (nfsclient:0_monitor_30000) on node1. failed (target: 0 vs. rc: 7): Error
> May 27 17:26:43 node1 crmd: [32374]: WARN: update_failcount: Updating
> failcount for nfsclient:0 on node1. after failed monitor: rc=7
> (update=value++, time=1274974003)
> ...
> May 27 17:26:43 node1 crmd: [32374]: info: match_graph_event: Action
> nfsclient:0_monitor_30000 (51) confirmed on node1. (rc=4)
> May 27 17:26:43 node1 attrd: [32372]: info: find_hash_entry: Creating hash
> entry for fail-count-nfsclient:0
> May 27 17:26:43 node1 attrd: [32372]: info: attrd_local_callback: Expanded
> fail-count-nfsclient:0=value++ to 1
> May 27 17:26:43 node1 attrd: [32372]: info: attrd_trigger_update: Sending
> flush op to all hosts for: fail-count-nfsclient:0 (1)
> May 27 17:26:43 node1 attrd: [32372]: info: attrd_perform_update: Sent
> update 17: fail-count-nfsclient:0=1
> ...
> May 27 17:26:43 node1 LVM[411]: INFO: Reading all physical volumes. This may
> take a while... Found volume group "VolGroup00" using metadata type lvm2
> Found volume group "vg_nfsdata_drbd" using metadata type lvm2
> May 27 17:26:44 node1 LVM[411]: INFO: 1 logical volume(s) in volume group
> "vg_nfsdata_drbd" now active
> May 27 17:26:44 node1 crmd: [32374]: info: process_lrm_event: LRM operation
> lv_nfsdata_drbd_start_0 (call=18, rc=0, cib-update=48, confirmed=true) ok
> May 27 17:26:44 node1 crmd: [32374]: info: match_graph_event: Action
> lv_nfsdata_drbd_start_0 (33) confirmed on node1. (rc=0)
> ...
> May 27 17:26:44 node1 pengine: [32373]: info: determine_online_status: Node
> node1. is online
> May 27 17:26:44 node1 pengine: [32373]: WARN: unpack_rsc_op: Processing
> failed op nfsclient:0_monitor_30000 on node1.: not running (7)
> May 27 17:26:44 node1 pengine: [32373]: notice: clone_print:  Master/Slave
> Set: NfsData
> May 27 17:26:44 node1 pengine: [32373]: notice: short_print:      Masters: [
> node1. ]
> May 27 17:26:44 node1 pengine: [32373]: notice: short_print:      Stopped: [
> nfsdrbd:1 ]
> May 27 17:26:44 node1 pengine: [32373]: notice: group_print:  Resource
> Group: nfs-group
> May 27 17:26:44 node1 pengine: [32373]: notice: native_print:
>  lv_nfsdata_drbd  (ocf::heartbeat:LVM):   Started node1.
> May 27 17:26:44 node1 pengine: [32373]: notice: native_print:      NfsFS
>  (ocf::heartbeat:Filesystem):    Stopped
> May 27 17:26:44 node1 pengine: [32373]: notice: native_print:      VIPlbtest
>        (ocf::heartbeat:IPaddr2):       Stopped
> May 27 17:26:44 node1 pengine: [32373]: notice: native_print:      nfssrv
> (ocf::heartbeat:nfsserver):     Stopped
> May 27 17:26:44 node1 pengine: [32373]: notice: clone_print:  Clone Set:
> cl-pinggw
> May 27 17:26:44 node1 pengine: [32373]: notice: short_print:      Started: [
> node1. ]
> May 27 17:26:44 node1 pengine: [32373]: notice: short_print:      Stopped: [
> pinggw:1 ]
> May 27 17:26:44 node1 pengine: [32373]: notice: clone_print:  Clone Set:
> apache_clone
> May 27 17:26:44 node1 pengine: [32373]: notice: group_print:      Resource
> Group: apache_group:0
> May 27 17:26:44 node1 pengine: [32373]: notice: native_print:
>  nfsclient:0  (ocf::heartbeat:Filesystem):    Started node1. FAILED
> May 27 17:26:44 node1 pengine: [32373]: notice: native_print:
>  apachepoli:0 (ocf::poli:apache2):    Stopped
> May 27 17:26:44 node1 pengine: [32373]: notice: short_print:      Stopped: [
> apache_group:1 ]
> May 27 17:26:44 node1 pengine: [32373]: info: get_failcount: apache_clone
> has failed 1 times on node1.
> May 27 17:26:44 node1 pengine: [32373]: notice: common_apply_stickiness:
> apache_clone can fail 999999 more times on node1. before being forced off
> ...
> May 27 17:26:44 node1 pengine: [32373]: notice: LogActions: Recover resource
> nfsclient:0    (Started node1.)
> May 27 17:26:44 node1 pengine: [32373]: notice: LogActions: Start
> apachepoli:0      (node1.)
> ...
> May 27 17:26:44 node1 crmd: [32374]: info: te_rsc_command: Initiating action
> 38: start NfsFS_start_0 on node1. (local)
> May 27 17:26:44 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing
> key=38:4:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=NfsFS_start_0 )
> May 27 17:26:44 node1 lrmd: [32371]: info: rsc:NfsFS:21: start
> ...
> May 27 17:26:44 node1 crmd: [32374]: info: te_rsc_command: Initiating action
> 3: stop nfsclient:0_stop_0 on node1. (local)
> ...
> May 27 17:26:44 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing
> key=3:4:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=nfsclient:0_stop_0 )
> May 27 17:26:44 node1 lrmd: [32371]: info: rsc:nfsclient:0:22: stop
> ...
> May 27 17:26:44 node1 Filesystem[493]: INFO: Running start for
> /dev/vg_nfsdata_drbd/lv_nfsdata_drbd on /nfsdata
> May 27 17:26:44 node1 Filesystem[502]: INFO: Running stop for
> viplbtest.:/nfsdata/web on /usr/local/data
> May 27 17:26:44 node1 crmd: [32374]: info: process_lrm_event: LRM operation
> nfsclient:0_stop_0 (call=22, rc=0, cib-update=50, confirmed=true) ok
> May 27 17:26:44 node1 crmd: [32374]: info: match_graph_event: Action
> nfsclient:0_stop_0 (3) confirmed on node1. (rc=0)
> ...
> May 27 17:26:44 node1 kernel: kjournald starting.  Commit interval 5 seconds
> May 27 17:26:44 node1 kernel: EXT3-fs warning: maximal mount count reached,
> running e2fsck is recommended
> May 27 17:26:44 node1 kernel: EXT3 FS on dm-0, internal journal
> May 27 17:26:44 node1 kernel: EXT3-fs: mounted filesystem with ordered data
> mode.
> May 27 17:26:44 node1 crmd: [32374]: info: process_lrm_event: LRM operation
> NfsFS_start_0 (call=21, rc=0, cib-update=51, confirmed=true) ok
> ...
> May 27 17:26:44 node1 crmd: [32374]: info: te_rsc_command: Initiating action
> 39: start VIPlbtest_start_0 on node1. (local)
> May 27 17:26:44 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing
> key=39:4:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=VIPlbtest_start_0 )
> May 27 17:26:44 node1 lrmd: [32371]: info: rsc:VIPlbtest:23: start
> ...
> May 27 17:26:44 node1 crmd: [32374]: info: process_lrm_event: LRM operation
> VIPlbtest_start_0 (call=23, rc=0, cib-update=53, confirmed=true) ok
> ...
> May 27 17:26:44 node1 crmd: [32374]: info: te_rsc_command: Initiating action
> 41: start nfssrv_start_0 on node1. (local)
> May 27 17:26:44 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing
> key=41:4:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=nfssrv_start_0 )
> May 27 17:26:44 node1 lrmd: [32371]: info: rsc:nfssrv:25: start
> ...
> May 27 17:26:44 node1 nfsserver[661]: INFO: Starting NFS server ...
> May 27 17:26:44 node1 kernel: NFSD: Using /var/lib/nfs/v4recovery as the
> NFSv4 state recovery directory
> May 27 17:26:44 node1 kernel: NFSD: starting 90-second grace period
> May 27 17:26:44 node1 lrmd: [32371]: info: RA output: (nfssrv:start:stdout)
> rpc.statd version 1.0.9
> May 27 17:26:44 node1 nfsserver[661]: INFO: NFS server started
> May 27 17:26:44 node1 crmd: [32374]: info: process_lrm_event: LRM operation
> nfssrv_start_0 (call=25, rc=0, cib-update=55, confirmed=true) ok
> ...
> May 27 17:26:44 node1 crmd: [32374]: info: te_rsc_command: Initiating action
> 52: start nfsclient:0_start_0 on node1. (local)
> May 27 17:26:44 node1 crmd: [32374]: info: do_lrm_rsc_op: Performing
> key=52:4:0:6b4ec352-3854-4caf-8882-8f060ba68420 op=nfsclient:0_start_0 )
> May 27 17:26:44 node1 lrmd: [32371]: info: rsc:nfsclient:0:26: start
> May 27 17:26:45 node1 Filesystem[771]: INFO: Running start for
> viplbtest.:/nfsdata/web on /usr/local/data
> May 27 17:26:46 node1 mountd[746]: authenticated mount request from
> 10.4.4.140:921 for /nfsdata/web (/nfsdata/web)
> May 27 17:26:46 node1 mountd[746]: authenticated mount request from
> 10.4.4.140:999 for /nfsdata/web (/nfsdata/web)
> May 27 17:26:46 node1 crmd: [32374]: info: process_lrm_event: LRM operation
> nfsclient:0_start_0 (call=26, rc=0, cib-update=56, confirmed=true) ok
> May 27 17:26:46 node1 crmd: [32374]: info: match_graph_event: Action
> nfsclient:0_start_0 (52) confirmed on node1. (rc=0)
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>