[Pacemaker] advisory ordering question

Fri May 28 01:58:16 EDT 2010

On Tue, May 25, 2010 at 3:39 PM, Dejan Muhamedagic <dejanmm at fastmail.fm> wrote:
> Hi,
>
> On Thu, May 20, 2010 at 06:09:01PM +0200, Gianluca Cecchi wrote:
>> Hello,
>> manual for 1.0 (and 1.1) reports this for Advisory Ordering:
>>
>> On the other-hand, when score="0" is specified for a constraint, the
>> constraint is considered optional and only has an effect when both resources
>> are stopping and or starting. Any change in state by the first resource will
>> have no effect on the then resource.
>>
>> (there is also a link to a
>> http://www.clusterlabs.org/mediawiki/images/d/d6/Ordering_Explained.pdf to
>> go deeper with constraints, but it seems broken right now...)
>>
>> Is this also true for order defined between a group and a clone and not
>> between resources?
>> Because I have this config
>>
>> order apache_after_nfsd 0: nfs-group apache_clone
>>
>> where
>>
>> group nfs-group lv_drbd0 ClusterIP NfsFS nfssrv \
>> meta target-role="Started"
>>
>> group apache_group nfsclient apache \
>> meta target-role="Started"
>>
>> clone apache_clone apache_group \
>> meta target-role="Started"
>>
>> And when I have both nodes up but with corosync stoppped on both and I start
>> corosync on one node, I see in logs that:
>> - inside nfs-group the lv_drbd0 (linbit drbd resource) is just promoted but
>> the following components (nfssrv in particular) have not started yet
>> - the nfsclient part of apache_clone tries to start, but fails because the
>> nfssrv is not in place yet
>>
>> I get the same problem if I change into
>> order apache_after_nfsd 0: nfssrv apache_clone
>>
>> So I presume the problem could be caused by the fact that the second part is
>> a clone and not a resource? or a bug?
>> I can eventually send the whole config.
>
> Looks like a bug to me. Clone or resource, constraints should be
> observed. Perhaps it's a duplicate of this one:
> http://developerbugs.linux-foundation.org/show_bug.cgi?id=2422

No. That one only applies to interleaved clone-to-clone ordering when
one clone contains a group and the group is partially active.
Quite a specific scenario.

>
>> Setting a value different from 0 for the interval parameter of op start for
>> nfsclient doesn't make sense, correct?
>
> Correct.
>
>> What would it determine?
>> A start every x seconds of the resource?
>
> Yes. crmd wouldn't even allow it.
>
> Thanks,
>
> Dejan
>
>> At the end of the process I have:
>> [root at webtest1 ]# crm_mon -fr1
>> ============
>> Last updated: Thu May 20 17:58:38 2010
>> Stack: openais
>> Current DC: webtest1. - partition WITHOUT quorum
>> Version: 1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7
>> 2 Nodes configured, 2 expected votes
>> 4 Resources configured.
>> ============
>>
>> Online: [ webtest1. ]
>> OFFLINE: [ webtest2. ]
>>
>> Full list of resources:
>>
>>  Master/Slave Set: NfsData
>>      Masters: [ webtest1. ]
>>      Stopped: [ nfsdrbd:1 ]
>>  Resource Group: nfs-group
>>      lv_nfsdata_drbd    (ocf::heartbeat:LVM):   Started webtest1.
>>      NfsFS      (ocf::heartbeat:Filesystem):    Started webtest1.
>>      VIPlbtest  (ocf::heartbeat:IPaddr2):       Started webtest1.
>>      nfssrv     (ocf::heartbeat:nfsserver):     Started webtest1.
>>  Clone Set: cl-pinggw
>>      Started: [ webtest1. ]
>>      Stopped: [ pinggw:1 ]
>>  Clone Set: apache_clone
>>      Stopped: [ apache_group:0 apache_group:1 ]
>>
>> Migration summary:
>> * Node webtest1.:  pingd=200
>>    nfsclient:0: migration-threshold=1000000 fail-count=1000000
>>
>> Failed actions:
>>     nfsclient:0_start_0 (node=webtest1., call=15, rc=1, status=complete):
>> unknown error
>>
>>
>> Example logs for the second case:
>>
>>
>> May 20 17:33:55 webtest1 pengine: [14080]: info: determine_online_status:
>> Node webtest1. is online
>> May 20 17:33:55 webtest1 pengine: [14080]: notice: clone_print:
>>  Master/Slave Set: NfsData
>> May 20 17:33:55 webtest1 pengine: [14080]: notice: short_print:
>>  Stopped: [ nfsdrbd:0 nfsdrbd:1 ]
>> May 20 17:33:55 webtest1 pengine: [14080]: notice: group_print:  Resource
>> Group: nfs-group
>> May 20 17:33:55 webtest1 pengine: [14080]: notice: native_print:
>>  lv_nfsdata_drbd  (ocf::heartbeat:LVM):   Stopped
>> May 20 17:33:55 webtest1 pengine: [14080]: notice: native_print:      NfsFS
>>    (ocf::heartbeat:Filesystem):    Stopped
>> May 20 17:33:55 webtest1 pengine: [14080]: notice: native_print:
>>  VIPlbtest        (ocf::heartbeat:IPaddr2):       Stopped
>> May 20 17:33:55 webtest1 pengine: [14080]: notice: native_print:      nfssrv
>>   (ocf::heartbeat:nfsserver):     Stopped
>> ...
>> May 20 17:33:55 webtest1 pengine: [14080]: notice: clone_print:  Clone Set:
>> apache_clone
>> May 20 17:33:55 webtest1 pengine: [14080]: notice: short_print:
>>  Stopped: [ apache_group:0 apache_group:1 ]
>> ...
>> May 20 17:33:55 webtest1 pengine: [14080]: notice: LogActions: Start
>> nfsdrbd:0 (webtest1.)
>> ...
>> May 20 17:33:55 webtest1 pengine: [14080]: notice: LogActions: Start
>> nfsclient:0       (webtest1.)
>> May 20 17:33:55 webtest1 pengine: [14080]: notice: LogActions: Start
>> apache:0      (webtest1.)
>> ...
>> May 20 17:33:57 webtest1 kernel: block drbd0: Starting worker thread (from
>> cqueue/0 [68])
>> May 20 17:33:57 webtest1 kernel: block drbd0: disk( Diskless -> Attaching )
>> May 20 17:33:57 webtest1 kernel: block drbd0: Found 4 transactions (7 active
>> extents) in activity log.
>> May 20 17:33:57 webtest1 kernel: block drbd0: Method to ensure write
>> ordering: barrier
>> May 20 17:33:57 webtest1 kernel: block drbd0: max_segment_size ( = BIO size
>> ) = 32768
>> May 20 17:33:57 webtest1 kernel: block drbd0: drbd_bm_resize called with
>> capacity == 8388280
>> May 20 17:33:57 webtest1 kernel: block drbd0: resync bitmap: bits=1048535
>> words=32768
>> May 20 17:33:57 webtest1 kernel: block drbd0: size = 4096 MB (4194140 KB)
>> May 20 17:33:57 webtest1 kernel: block drbd0: recounting of set bits took
>> additional 0 jiffies
>> May 20 17:33:57 webtest1 kernel: block drbd0: 144 KB (36 bits) marked
>> out-of-sync by on disk bit-map.
>> May 20 17:33:57 webtest1 kernel: block drbd0: disk( Attaching -> UpToDate )
>> pdsk( DUnknown -> Outdated )
>> May 20 17:33:57 webtest1 kernel: block drbd0: conn( StandAlone ->
>> Unconnected )
>> May 20 17:33:57 webtest1 kernel: block drbd0: Starting receiver thread (from
>> drbd0_worker [14378])
>> May 20 17:33:57 webtest1 kernel: block drbd0: receiver (re)started
>> May 20 17:33:57 webtest1 kernel: block drbd0: conn( Unconnected ->
>> WFConnection )
>> May 20 17:33:57 webtest1 lrmd: [14078]: info: RA output:
>> (nfsdrbd:0:start:stdout)
>> May 20 17:33:57 webtest1 attrd: [14079]: info: attrd_trigger_update: Sending
>> flush op to all hosts for: master-nfsdrbd:0 (10000)
>> May 20 17:33:57 webtest1 attrd: [14079]: info: attrd_perform_update: Sent
>> update 11: master-nfsdrbd:0=10000
>> May 20 17:33:57 webtest1 crmd: [14081]: info: abort_transition_graph:
>> te_update_diff:146 - Triggered transition abort (complete=0,
>> tag=transient_attributes, id=webtest1., magic=NA, cib=0.407.11) : Transient
>> attribute: update
>> May 20 17:33:57 webtest1 lrmd: [14078]: info: RA output:
>> (nfsdrbd:0:start:stdout)
>> May 20 17:33:57 webtest1 crmd: [14081]: info: process_lrm_event: LRM
>> operation nfsdrbd:0_start_0 (call=10, rc=0, cib-update=37, confirmed=true)
>> ok
>> May 20 17:33:57 webtest1 crmd: [14081]: info: match_graph_event: Action
>> nfsdrbd:0_start_0 (12) confirmed on webtest1. (rc=0)
>> May 20 17:33:57 webtest1 crmd: [14081]: info: te_pseudo_action: Pseudo
>> action 15 fired and confirmed
>> May 20 17:33:57 webtest1 crmd: [14081]: info: te_pseudo_action: Pseudo
>> action 18 fired and confirmed
>> May 20 17:33:57 webtest1 crmd: [14081]: info: te_rsc_command: Initiating
>> action 90: notify nfsdrbd:0_post_notify_start_0 on webtest1. (local)
>> May 20 17:33:57 webtest1 crmd: [14081]: info: do_lrm_rsc_op: Performing
>> key=90:1:0:bf5161a2-5240-4aaf-bc7d-5f54044f5bb6 op=nfsdrbd:0_notify_0 )
>> May 20 17:33:57 webtest1 lrmd: [14078]: info: rsc:nfsdrbd:0:12: notify
>> May 20 17:33:57 webtest1 lrmd: [14078]: info: RA output:
>> (nfsdrbd:0:notify:stdout)
>> ...
>> May 20 17:34:01 webtest1 pengine: [14080]: info: master_color: Promoting
>> nfsdrbd:0 (Slave webtest1.)
>> May 20 17:34:01 webtest1 pengine: [14080]: info: master_color: NfsData:
>> Promoted 1 instances of a possible 1 to master
>> ...
>> May 20 17:34:01 webtest1 crmd: [14081]: info: te_rsc_command: Initiating
>> action 85: notify nfsdrbd:0_pre_notify_promote_0 on webtest1. (local)
>> May 20 17:34:01 webtest1 crmd: [14081]: info: do_lrm_rsc_op: Performing
>> key=85:2:0:bf5161a2-5240-4aaf-bc7d-5f54044f5bb6 op=nfsdrbd:0_notify_0 )
>> May 20 17:34:01 webtest1 lrmd: [14078]: info: rsc:nfsdrbd:0:14: notify
>> May 20 17:34:01 webtest1 crmd: [14081]: info: te_pseudo_action: Pseudo
>> action 47 fired and confirmed
>> May 20 17:34:01 webtest1 crmd: [14081]: info: te_rsc_command: Initiating
>> action 43: start nfsclient:0_start_0 on webtest1. (local)
>> May 20 17:34:01 webtest1 crmd: [14081]: info: do_lrm_rsc_op: Performing
>> key=43:2:0:bf5161a2-5240-4aaf-bc7d-5f54044f5bb6 op=nfsclient:0_start_0 )
>> May 20 17:34:01 webtest1 lrmd: [14078]: info: rsc:nfsclient:0:15: start
>> May 20 17:34:01 webtest1 crmd: [14081]: info: process_lrm_event: LRM
>> operation nfsdrbd:0_notify_0 (call=14, rc=0, cib-update=41, confirmed=true)
>> ok
>> May 20 17:34:01 webtest1 crmd: [14081]: info: match_graph_event: Action
>> nfsdrbd:0_pre_notify_promote_0 (85) confirmed on webtest1. (rc=0)
>> May 20 17:34:01 webtest1 crmd: [14081]: info: te_pseudo_action: Pseudo
>> action 23 fired and confirmed
>> ...
>> May 20 17:34:01 webtest1 crmd: [14081]: info: te_pseudo_action: Pseudo
>> action 20 fired and confirmed
>> May 20 17:34:01 webtest1 crmd: [14081]: info: te_rsc_command: Initiating
>> action 7: promote nfsdrbd:0_promote_0 on webtest1. (local)
>> May 20 17:34:01 webtest1 crmd: [14081]: info: do_lrm_rsc_op: Performing
>> key=7:2:0:bf5161a2-5240-4aaf-bc7d-5f54044f5bb6 op=nfsdrbd:0_promote_0 )
>> May 20 17:34:01 webtest1 lrmd: [14078]: info: rsc:nfsdrbd:0:16: promote
>> May 20 17:34:02 webtest1 kernel: block drbd0: role( Secondary -> Primary )
>> May 20 17:34:02 webtest1 lrmd: [14078]: info: RA output:
>> (nfsdrbd:0:promote:stdout)
>> May 20 17:34:02 webtest1 crmd: [14081]: info: process_lrm_event: LRM
>> operation nfsdrbd:0_promote_0 (call=16, rc=0, cib-update=42, confirmed=true)
>> ok
>> May 20 17:34:02 webtest1 crmd: [14081]: info: match_graph_event: Action
>> nfsdrbd:0_promote_0 (7) confirmed on webtest1. (rc=0)
>> May 20 17:34:02 webtest1 crmd: [14081]: info: te_pseudo_action: Pseudo
>> action 21 fired and confirmed
>> May 20 17:34:02 webtest1 crmd: [14081]: info: te_pseudo_action: Pseudo
>> action 24 fired and confirmed
>> May 20 17:34:02 webtest1 crmd: [14081]: info: te_rsc_command: Initiating
>> action 86: notify nfsdrbd:0_post_notify_promote_0 on webtest1. (local)
>> May 20 17:34:02 webtest1 crmd: [14081]: info: do_lrm_rsc_op: Performing
>> key=86:2:0:bf5161a2-5240-4aaf-bc7d-5f54044f5bb6 op=nfsdrbd:0_notify_0 )
>> May 20 17:34:02 webtest1 lrmd: [14078]: info: rsc:nfsdrbd:0:17: notify
>> May 20 17:34:02 webtest1 lrmd: [14078]: info: RA output:
>> (nfsdrbd:0:notify:stdout)
>> May 20 17:34:02 webtest1 crmd: [14081]: info: process_lrm_event: LRM
>> operation nfsdrbd:0_notify_0 (call=17, rc=0, cib-update=43, confirmed=true)
>> ok
>> May 20 17:34:02 webtest1 crmd: [14081]: info: match_graph_event: Action
>> nfsdrbd:0_post_notify_promote_0 (86) confirmed on webtest1. (rc=0)
>> May 20 17:34:02 webtest1 crmd: [14081]: info: te_pseudo_action: Pseudo
>> action 25 fired and confirmed
>> May 20 17:34:02 webtest1 Filesystem[14438]: INFO: Running start for
>> viplbtest.:/nfsdata/web on /usr/local/data
>> May 20 17:34:06 webtest1 crmd: [14081]: info: process_lrm_event: LRM
>> operation pinggw:0_monitor_10000 (call=13, rc=0, cib-update=44,
>> confirmed=false) ok
>> May 20 17:34:06 webtest1 crmd: [14081]: info: match_graph_event: Action
>> pinggw:0_monitor_10000 (38) confirmed on webtest1. (rc=0)
>> May 20 17:34:11 webtest1 attrd: [14079]: info: attrd_trigger_update: Sending
>> flush op to all hosts for: pingd (200)
>> May 20 17:34:11 webtest1 attrd: [14079]: info: attrd_perform_update: Sent
>> update 14: pingd=200
>> May 20 17:34:11 webtest1 crmd: [14081]: info: abort_transition_graph:
>> te_update_diff:146 - Triggered transition abort (complete=0,
>> tag=transient_attributes, id=webtest1., magic=NA, cib=0.407.19) : Transient
>> attribute: update
>> May 20 17:34:11 webtest1 crmd: [14081]: info: update_abort_priority: Abort
>> priority upgraded from 0 to 1000000
>> May 20 17:34:11 webtest1 crmd: [14081]: info: update_abort_priority: Abort
>> action done superceeded by restart
>> May 20 17:34:14 webtest1 lrmd: [14078]: info: RA output:
>> (nfsclient:0:start:stderr) mount: mount to NFS server 'viplbtest.' failed:
>> System Error: No route to host.
>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>