[Pacemaker] advisory ordering question

Tue May 25 13:39:09 UTC 2010

Hi,

On Thu, May 20, 2010 at 06:09:01PM +0200, Gianluca Cecchi wrote:
> Hello,
> manual for 1.0 (and 1.1) reports this for Advisory Ordering:
> 
> On the other-hand, when score="0" is specified for a constraint, the
> constraint is considered optional and only has an effect when both resources
> are stopping and or starting. Any change in state by the first resource will
> have no effect on the then resource.
> 
> (there is also a link to a
> http://www.clusterlabs.org/mediawiki/images/d/d6/Ordering_Explained.pdf to
> go deeper with constraints, but it seems broken right now...)
> 
> Is this also true for order defined between a group and a clone and not
> between resources?
> Because I have this config
> 
> order apache_after_nfsd 0: nfs-group apache_clone
> 
> where
> 
> group nfs-group lv_drbd0 ClusterIP NfsFS nfssrv \
> meta target-role="Started"
> 
> group apache_group nfsclient apache \
> meta target-role="Started"
> 
> clone apache_clone apache_group \
> meta target-role="Started"
> 
> And when I have both nodes up but with corosync stoppped on both and I start
> corosync on one node, I see in logs that:
> - inside nfs-group the lv_drbd0 (linbit drbd resource) is just promoted but
> the following components (nfssrv in particular) have not started yet
> - the nfsclient part of apache_clone tries to start, but fails because the
> nfssrv is not in place yet
> 
> I get the same problem if I change into
> order apache_after_nfsd 0: nfssrv apache_clone
> 
> So I presume the problem could be caused by the fact that the second part is
> a clone and not a resource? or a bug?
> I can eventually send the whole config.

Looks like a bug to me. Clone or resource, constraints should be
observed. Perhaps it's a duplicate of this one:
http://developerbugs.linux-foundation.org/show_bug.cgi?id=2422

> Setting a value different from 0 for the interval parameter of op start for
> nfsclient doesn't make sense, correct?

Correct.

> What would it determine?
> A start every x seconds of the resource?

Yes. crmd wouldn't even allow it.

Thanks,

Dejan

> At the end of the process I have:
> [root at webtest1 ]# crm_mon -fr1
> ============
> Last updated: Thu May 20 17:58:38 2010
> Stack: openais
> Current DC: webtest1. - partition WITHOUT quorum
> Version: 1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7
> 2 Nodes configured, 2 expected votes
> 4 Resources configured.
> ============
> 
> Online: [ webtest1. ]
> OFFLINE: [ webtest2. ]
> 
> Full list of resources:
> 
>  Master/Slave Set: NfsData
>      Masters: [ webtest1. ]
>      Stopped: [ nfsdrbd:1 ]
>  Resource Group: nfs-group
>      lv_nfsdata_drbd    (ocf::heartbeat:LVM):   Started webtest1.
>      NfsFS      (ocf::heartbeat:Filesystem):    Started webtest1.
>      VIPlbtest  (ocf::heartbeat:IPaddr2):       Started webtest1.
>      nfssrv     (ocf::heartbeat:nfsserver):     Started webtest1.
>  Clone Set: cl-pinggw
>      Started: [ webtest1. ]
>      Stopped: [ pinggw:1 ]
>  Clone Set: apache_clone
>      Stopped: [ apache_group:0 apache_group:1 ]
> 
> Migration summary:
> * Node webtest1.:  pingd=200
>    nfsclient:0: migration-threshold=1000000 fail-count=1000000
> 
> Failed actions:
>     nfsclient:0_start_0 (node=webtest1., call=15, rc=1, status=complete):
> unknown error
> 
> 
> Example logs for the second case:
> 
> 
> May 20 17:33:55 webtest1 pengine: [14080]: info: determine_online_status:
> Node webtest1. is online
> May 20 17:33:55 webtest1 pengine: [14080]: notice: clone_print:
>  Master/Slave Set: NfsData
> May 20 17:33:55 webtest1 pengine: [14080]: notice: short_print:
>  Stopped: [ nfsdrbd:0 nfsdrbd:1 ]
> May 20 17:33:55 webtest1 pengine: [14080]: notice: group_print:  Resource
> Group: nfs-group
> May 20 17:33:55 webtest1 pengine: [14080]: notice: native_print:
>  lv_nfsdata_drbd  (ocf::heartbeat:LVM):   Stopped
> May 20 17:33:55 webtest1 pengine: [14080]: notice: native_print:      NfsFS
>    (ocf::heartbeat:Filesystem):    Stopped
> May 20 17:33:55 webtest1 pengine: [14080]: notice: native_print:
>  VIPlbtest        (ocf::heartbeat:IPaddr2):       Stopped
> May 20 17:33:55 webtest1 pengine: [14080]: notice: native_print:      nfssrv
>   (ocf::heartbeat:nfsserver):     Stopped
> ...
> May 20 17:33:55 webtest1 pengine: [14080]: notice: clone_print:  Clone Set:
> apache_clone
> May 20 17:33:55 webtest1 pengine: [14080]: notice: short_print:
>  Stopped: [ apache_group:0 apache_group:1 ]
> ...
> May 20 17:33:55 webtest1 pengine: [14080]: notice: LogActions: Start
> nfsdrbd:0 (webtest1.)
> ...
> May 20 17:33:55 webtest1 pengine: [14080]: notice: LogActions: Start
> nfsclient:0       (webtest1.)
> May 20 17:33:55 webtest1 pengine: [14080]: notice: LogActions: Start
> apache:0      (webtest1.)
> ...
> May 20 17:33:57 webtest1 kernel: block drbd0: Starting worker thread (from
> cqueue/0 [68])
> May 20 17:33:57 webtest1 kernel: block drbd0: disk( Diskless -> Attaching )
> May 20 17:33:57 webtest1 kernel: block drbd0: Found 4 transactions (7 active
> extents) in activity log.
> May 20 17:33:57 webtest1 kernel: block drbd0: Method to ensure write
> ordering: barrier
> May 20 17:33:57 webtest1 kernel: block drbd0: max_segment_size ( = BIO size
> ) = 32768
> May 20 17:33:57 webtest1 kernel: block drbd0: drbd_bm_resize called with
> capacity == 8388280
> May 20 17:33:57 webtest1 kernel: block drbd0: resync bitmap: bits=1048535
> words=32768
> May 20 17:33:57 webtest1 kernel: block drbd0: size = 4096 MB (4194140 KB)
> May 20 17:33:57 webtest1 kernel: block drbd0: recounting of set bits took
> additional 0 jiffies
> May 20 17:33:57 webtest1 kernel: block drbd0: 144 KB (36 bits) marked
> out-of-sync by on disk bit-map.
> May 20 17:33:57 webtest1 kernel: block drbd0: disk( Attaching -> UpToDate )
> pdsk( DUnknown -> Outdated )
> May 20 17:33:57 webtest1 kernel: block drbd0: conn( StandAlone ->
> Unconnected )
> May 20 17:33:57 webtest1 kernel: block drbd0: Starting receiver thread (from
> drbd0_worker [14378])
> May 20 17:33:57 webtest1 kernel: block drbd0: receiver (re)started
> May 20 17:33:57 webtest1 kernel: block drbd0: conn( Unconnected ->
> WFConnection )
> May 20 17:33:57 webtest1 lrmd: [14078]: info: RA output:
> (nfsdrbd:0:start:stdout)
> May 20 17:33:57 webtest1 attrd: [14079]: info: attrd_trigger_update: Sending
> flush op to all hosts for: master-nfsdrbd:0 (10000)
> May 20 17:33:57 webtest1 attrd: [14079]: info: attrd_perform_update: Sent
> update 11: master-nfsdrbd:0=10000
> May 20 17:33:57 webtest1 crmd: [14081]: info: abort_transition_graph:
> te_update_diff:146 - Triggered transition abort (complete=0,
> tag=transient_attributes, id=webtest1., magic=NA, cib=0.407.11) : Transient
> attribute: update
> May 20 17:33:57 webtest1 lrmd: [14078]: info: RA output:
> (nfsdrbd:0:start:stdout)
> May 20 17:33:57 webtest1 crmd: [14081]: info: process_lrm_event: LRM
> operation nfsdrbd:0_start_0 (call=10, rc=0, cib-update=37, confirmed=true)
> ok
> May 20 17:33:57 webtest1 crmd: [14081]: info: match_graph_event: Action
> nfsdrbd:0_start_0 (12) confirmed on webtest1. (rc=0)
> May 20 17:33:57 webtest1 crmd: [14081]: info: te_pseudo_action: Pseudo
> action 15 fired and confirmed
> May 20 17:33:57 webtest1 crmd: [14081]: info: te_pseudo_action: Pseudo
> action 18 fired and confirmed
> May 20 17:33:57 webtest1 crmd: [14081]: info: te_rsc_command: Initiating
> action 90: notify nfsdrbd:0_post_notify_start_0 on webtest1. (local)
> May 20 17:33:57 webtest1 crmd: [14081]: info: do_lrm_rsc_op: Performing
> key=90:1:0:bf5161a2-5240-4aaf-bc7d-5f54044f5bb6 op=nfsdrbd:0_notify_0 )
> May 20 17:33:57 webtest1 lrmd: [14078]: info: rsc:nfsdrbd:0:12: notify
> May 20 17:33:57 webtest1 lrmd: [14078]: info: RA output:
> (nfsdrbd:0:notify:stdout)
> ...
> May 20 17:34:01 webtest1 pengine: [14080]: info: master_color: Promoting
> nfsdrbd:0 (Slave webtest1.)
> May 20 17:34:01 webtest1 pengine: [14080]: info: master_color: NfsData:
> Promoted 1 instances of a possible 1 to master
> ...
> May 20 17:34:01 webtest1 crmd: [14081]: info: te_rsc_command: Initiating
> action 85: notify nfsdrbd:0_pre_notify_promote_0 on webtest1. (local)
> May 20 17:34:01 webtest1 crmd: [14081]: info: do_lrm_rsc_op: Performing
> key=85:2:0:bf5161a2-5240-4aaf-bc7d-5f54044f5bb6 op=nfsdrbd:0_notify_0 )
> May 20 17:34:01 webtest1 lrmd: [14078]: info: rsc:nfsdrbd:0:14: notify
> May 20 17:34:01 webtest1 crmd: [14081]: info: te_pseudo_action: Pseudo
> action 47 fired and confirmed
> May 20 17:34:01 webtest1 crmd: [14081]: info: te_rsc_command: Initiating
> action 43: start nfsclient:0_start_0 on webtest1. (local)
> May 20 17:34:01 webtest1 crmd: [14081]: info: do_lrm_rsc_op: Performing
> key=43:2:0:bf5161a2-5240-4aaf-bc7d-5f54044f5bb6 op=nfsclient:0_start_0 )
> May 20 17:34:01 webtest1 lrmd: [14078]: info: rsc:nfsclient:0:15: start
> May 20 17:34:01 webtest1 crmd: [14081]: info: process_lrm_event: LRM
> operation nfsdrbd:0_notify_0 (call=14, rc=0, cib-update=41, confirmed=true)
> ok
> May 20 17:34:01 webtest1 crmd: [14081]: info: match_graph_event: Action
> nfsdrbd:0_pre_notify_promote_0 (85) confirmed on webtest1. (rc=0)
> May 20 17:34:01 webtest1 crmd: [14081]: info: te_pseudo_action: Pseudo
> action 23 fired and confirmed
> ...
> May 20 17:34:01 webtest1 crmd: [14081]: info: te_pseudo_action: Pseudo
> action 20 fired and confirmed
> May 20 17:34:01 webtest1 crmd: [14081]: info: te_rsc_command: Initiating
> action 7: promote nfsdrbd:0_promote_0 on webtest1. (local)
> May 20 17:34:01 webtest1 crmd: [14081]: info: do_lrm_rsc_op: Performing
> key=7:2:0:bf5161a2-5240-4aaf-bc7d-5f54044f5bb6 op=nfsdrbd:0_promote_0 )
> May 20 17:34:01 webtest1 lrmd: [14078]: info: rsc:nfsdrbd:0:16: promote
> May 20 17:34:02 webtest1 kernel: block drbd0: role( Secondary -> Primary )
> May 20 17:34:02 webtest1 lrmd: [14078]: info: RA output:
> (nfsdrbd:0:promote:stdout)
> May 20 17:34:02 webtest1 crmd: [14081]: info: process_lrm_event: LRM
> operation nfsdrbd:0_promote_0 (call=16, rc=0, cib-update=42, confirmed=true)
> ok
> May 20 17:34:02 webtest1 crmd: [14081]: info: match_graph_event: Action
> nfsdrbd:0_promote_0 (7) confirmed on webtest1. (rc=0)
> May 20 17:34:02 webtest1 crmd: [14081]: info: te_pseudo_action: Pseudo
> action 21 fired and confirmed
> May 20 17:34:02 webtest1 crmd: [14081]: info: te_pseudo_action: Pseudo
> action 24 fired and confirmed
> May 20 17:34:02 webtest1 crmd: [14081]: info: te_rsc_command: Initiating
> action 86: notify nfsdrbd:0_post_notify_promote_0 on webtest1. (local)
> May 20 17:34:02 webtest1 crmd: [14081]: info: do_lrm_rsc_op: Performing
> key=86:2:0:bf5161a2-5240-4aaf-bc7d-5f54044f5bb6 op=nfsdrbd:0_notify_0 )
> May 20 17:34:02 webtest1 lrmd: [14078]: info: rsc:nfsdrbd:0:17: notify
> May 20 17:34:02 webtest1 lrmd: [14078]: info: RA output:
> (nfsdrbd:0:notify:stdout)
> May 20 17:34:02 webtest1 crmd: [14081]: info: process_lrm_event: LRM
> operation nfsdrbd:0_notify_0 (call=17, rc=0, cib-update=43, confirmed=true)
> ok
> May 20 17:34:02 webtest1 crmd: [14081]: info: match_graph_event: Action
> nfsdrbd:0_post_notify_promote_0 (86) confirmed on webtest1. (rc=0)
> May 20 17:34:02 webtest1 crmd: [14081]: info: te_pseudo_action: Pseudo
> action 25 fired and confirmed
> May 20 17:34:02 webtest1 Filesystem[14438]: INFO: Running start for
> viplbtest.:/nfsdata/web on /usr/local/data
> May 20 17:34:06 webtest1 crmd: [14081]: info: process_lrm_event: LRM
> operation pinggw:0_monitor_10000 (call=13, rc=0, cib-update=44,
> confirmed=false) ok
> May 20 17:34:06 webtest1 crmd: [14081]: info: match_graph_event: Action
> pinggw:0_monitor_10000 (38) confirmed on webtest1. (rc=0)
> May 20 17:34:11 webtest1 attrd: [14079]: info: attrd_trigger_update: Sending
> flush op to all hosts for: pingd (200)
> May 20 17:34:11 webtest1 attrd: [14079]: info: attrd_perform_update: Sent
> update 14: pingd=200
> May 20 17:34:11 webtest1 crmd: [14081]: info: abort_transition_graph:
> te_update_diff:146 - Triggered transition abort (complete=0,
> tag=transient_attributes, id=webtest1., magic=NA, cib=0.407.19) : Transient
> attribute: update
> May 20 17:34:11 webtest1 crmd: [14081]: info: update_abort_priority: Abort
> priority upgraded from 0 to 1000000
> May 20 17:34:11 webtest1 crmd: [14081]: info: update_abort_priority: Abort
> action done superceeded by restart
> May 20 17:34:14 webtest1 lrmd: [14078]: info: RA output:
> (nfsclient:0:start:stderr) mount: mount to NFS server 'viplbtest.' failed:
> System Error: No route to host.

> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf