[ClusterLabs] Cluster failover failure with Unresolved dependency

Wed Mar 16 17:20:36 CET 2016

Dear Ken,

I already modified the startup as suggested during testing, thanks! I
swapped the postfix ocf resource to the amavisd systemd resource, as latter
controls postfix startup also as it turns out and having both resouces in
the mail-services group causes conflicts (postfix is detected as not
running).

Still experiencing the same behaviour, killing amavisd returns an rc=7 for
the monitoring operation on the "victim" node, this soungs logical, but the
logs contain the same: amavisd and virtualip cannot run anywhere.

I made sure systemd is clean (amavisd = inactive, not running instead of
failed) and also reset the failcount on all resources before killing
amavisd.

How can I make sure to have a clean state for the resources beside above
actions?
Also note: when causing a filesystem resource to fail (e.g. with unmout),
the failover happens successfully, all resources are started on the
"survivor" node.

Best regards,
Lorand

On Wed, Mar 16, 2016 at 4:34 PM, Ken Gaillot <kgaillot at redhat.com> wrote:

> On 03/16/2016 05:49 AM, Lorand Kelemen wrote:
> > Dear Ken,
> >
> > Thanks for the reply! I lowered migration-threshold to 1 and rearranged
> > contraints like you suggested:
> >
> > Location Constraints:
> > Ordering Constraints:
> >   promote mail-clone then start fs-services (kind:Mandatory)
> >   promote spool-clone then start fs-services (kind:Mandatory)
> >   start fs-services then start network-services (kind:Mandatory)
>
> Certainly not a big deal, but I would change the above constraint to
> start fs-services then start mail-services. The IP doesn't care whether
> the filesystems are up yet or not, but postfix does.
>
> >   start network-services then start mail-services (kind:Mandatory)
> > Colocation Constraints:
> >   fs-services with spool-clone (score:INFINITY) (rsc-role:Started)
> > (with-rsc-role:Master)
> >   fs-services with mail-clone (score:INFINITY) (rsc-role:Started)
> > (with-rsc-role:Master)
> >   network-services with mail-services (score:INFINITY)
> >   mail-services with fs-services (score:INFINITY)
> >
> > Now virtualip and postfix becomes stopped, I guess these are relevant
> but I
> > attach also full logs:
> >
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > native_color: Resource postfix cannot run anywhere
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > native_color: Resource virtualip-1 cannot run anywhere
> >
> > Interesting, will try to play around with ordering - colocation, the
> > solution must be in these settings...
> >
> > Best regards,
> > Lorand
> >
> > Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> > cib_perform_op:       Diff: --- 0.215.7 2
> > Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> > cib_perform_op:       Diff: +++ 0.215.8 (null)
> > Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> > cib_perform_op:       +  /cib:  @num_updates=8
> > Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> > cib_perform_op:       ++
> >
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='postfix']:
> >  <lrm_rsc_op id="postfix_last_failure_0"
> > operation_key="postfix_monitor_45000" operation="monitor"
> > crm-debug-origin="do_update_resource" crm_feature_set="3.0.10"
> > transition-key="86:2962:0:ae755a85-c250-498f-9c94-ddd8a7e2788a"
> > transition-magic="0:7;86:2962:0:ae755a85-c250-498f-9c94-ddd8a7e2788a"
> > on_node="mail1" call-id="1333" rc-code="7"
> > Mar 16 11:38:06 [7420] HWJ-626.domain.local       crmd:     info:
> > abort_transition_graph:       Transition aborted by postfix_monitor_45000
> > 'create' on mail1: Inactive graph
> > (magic=0:7;86:2962:0:ae755a85-c250-498f-9c94-ddd8a7e2788a, cib=0.215.8,
> > source=process_graph_event:598, 1)
> > Mar 16 11:38:06 [7420] HWJ-626.domain.local       crmd:     info:
> > update_failcount:     Updating failcount for postfix on mail1 after
> failed
> > monitor: rc=7 (update=value++, time=1458124686)
>
> I don't think your constraints are causing problems now; the above
> message indicates that the postfix resource failed. Postfix may not be
> able to run anywhere because it's already failed on both nodes, and the
> IP would be down because it has to be colocated with postfix, and
> postfix can't run.
>
> The rc=7 above indicates that the postfix agent's monitor operation
> returned 7, which is "not running". I'd check the logs for postfix errors.
>
> > Mar 16 11:38:06 [7420] HWJ-626.domain.local       crmd:     info:
> > process_graph_event:  Detected action (2962.86)
> > postfix_monitor_45000.1333=not running: failed
> > Mar 16 11:38:06 [7418] HWJ-626.domain.local      attrd:     info:
> > attrd_client_update:  Expanded fail-count-postfix=value++ to 1
> > Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> > cib_process_request:  Completed cib_modify operation for section status:
> OK
> > (rc=0, origin=mail1/crmd/253, version=0.215.8)
> > Mar 16 11:38:06 [7418] HWJ-626.domain.local      attrd:     info:
> > attrd_peer_update:    Setting fail-count-postfix[mail1]: (null) -> 1 from
> > mail2
> > Mar 16 11:38:06 [7420] HWJ-626.domain.local       crmd:   notice:
> > do_state_transition:  State transition S_IDLE -> S_POLICY_ENGINE [
> > input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> > Mar 16 11:38:06 [7418] HWJ-626.domain.local      attrd:     info:
> > write_attribute:      Sent update 406 with 2 changes for
> > fail-count-postfix, id=<n/a>, set=(null)
> > Mar 16 11:38:06 [7418] HWJ-626.domain.local      attrd:     info:
> > attrd_peer_update:    Setting last-failure-postfix[mail1]: 1458124291 ->
> > 1458124686 from mail2
> > Mar 16 11:38:06 [7418] HWJ-626.domain.local      attrd:     info:
> > write_attribute:      Sent update 407 with 2 changes for
> > last-failure-postfix, id=<n/a>, set=(null)
> > Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> > cib_process_request:  Forwarding cib_modify operation for section status
> to
> > master (origin=local/attrd/406)
> > Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> > cib_process_request:  Forwarding cib_modify operation for section status
> to
> > master (origin=local/attrd/407)
> > Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> > cib_perform_op:       Diff: --- 0.215.8 2
> > Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> > cib_perform_op:       Diff: +++ 0.215.9 (null)
> > Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> > cib_perform_op:       +  /cib:  @num_updates=9
> > Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> > cib_perform_op:       ++
> >
> /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']:
> >  <nvpair id="status-1-fail-count-postfix" name="fail-count-postfix"
> > value="1"/>
> > Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> > cib_process_request:  Completed cib_modify operation for section status:
> OK
> > (rc=0, origin=mail2/attrd/406, version=0.215.9)
> > Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> > cib_perform_op:       Diff: --- 0.215.9 2
> > Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> > cib_perform_op:       Diff: +++ 0.215.10 (null)
> > Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> > cib_perform_op:       +  /cib:  @num_updates=10
> > Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> > cib_perform_op:       +
> >
> /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-last-failure-postfix']:
> >  @value=1458124686
> > Mar 16 11:38:06 [7418] HWJ-626.domain.local      attrd:     info:
> > attrd_cib_callback:   Update 406 for fail-count-postfix: OK (0)
> > Mar 16 11:38:06 [7418] HWJ-626.domain.local      attrd:     info:
> > attrd_cib_callback:   Update 406 for fail-count-postfix[mail1]=1: OK (0)
> > Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> > cib_process_request:  Completed cib_modify operation for section status:
> OK
> > (rc=0, origin=mail2/attrd/407, version=0.215.10)
> > Mar 16 11:38:06 [7418] HWJ-626.domain.local      attrd:     info:
> > attrd_cib_callback:   Update 406 for fail-count-postfix[mail2]=(null): OK
> > (0)
> > Mar 16 11:38:06 [7418] HWJ-626.domain.local      attrd:     info:
> > attrd_cib_callback:   Update 407 for last-failure-postfix: OK (0)
> > Mar 16 11:38:06 [7418] HWJ-626.domain.local      attrd:     info:
> > attrd_cib_callback:   Update 407 for
> > last-failure-postfix[mail1]=1458124686: OK (0)
> > Mar 16 11:38:06 [7418] HWJ-626.domain.local      attrd:     info:
> > attrd_cib_callback:   Update 407 for
> > last-failure-postfix[mail2]=1457610376: OK (0)
> > Mar 16 11:38:06 [7420] HWJ-626.domain.local       crmd:     info:
> > abort_transition_graph:       Transition aborted by
> > status-1-fail-count-postfix, fail-count-postfix=1: Transient attribute
> > change (create cib=0.215.9, source=abort_unless_down:319,
> >
> path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1'],
> > 1)
> > Mar 16 11:38:06 [7420] HWJ-626.domain.local       crmd:     info:
> > abort_transition_graph:       Transition aborted by
> > status-1-last-failure-postfix, last-failure-postfix=1458124686: Transient
> > attribute change (modify cib=0.215.10, source=abort_unless_down:319,
> >
> path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-last-failure-postfix'],
> > 1)
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:   notice:
> > unpack_config:        On loss of CCM Quorum: Ignore
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > determine_online_status:      Node mail1 is online
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > determine_online_status:      Node mail2 is online
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > determine_op_status:  Operation monitor found resource mail:0 active in
> > master mode on mail1
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > determine_op_status:  Operation monitor found resource spool:0 active in
> > master mode on mail1
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > determine_op_status:  Operation monitor found resource fs-spool active on
> > mail1
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > determine_op_status:  Operation monitor found resource fs-spool active on
> > mail1
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > determine_op_status:  Operation monitor found resource fs-mail active on
> > mail1
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > determine_op_status:  Operation monitor found resource fs-mail active on
> > mail1
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:  warning:
> > unpack_rsc_op_failure:        Processing failed op monitor for postfix on
> > mail1: not running (7)
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > determine_op_status:  Operation monitor found resource spool:1 active in
> > master mode on mail2
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > determine_op_status:  Operation monitor found resource mail:1 active in
> > master mode on mail2
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > group_print:   Resource Group: network-services
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > native_print:      virtualip-1        (ocf::heartbeat:IPaddr2):
>  Started
> > mail1
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > clone_print:   Master/Slave Set: spool-clone [spool]
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > short_print:       Masters: [ mail1 ]
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > short_print:       Slaves: [ mail2 ]
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > clone_print:   Master/Slave Set: mail-clone [mail]
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > short_print:       Masters: [ mail1 ]
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > short_print:       Slaves: [ mail2 ]
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > group_print:   Resource Group: fs-services
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > native_print:      fs-spool   (ocf::heartbeat:Filesystem):    Started
> mail1
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > native_print:      fs-mail    (ocf::heartbeat:Filesystem):    Started
> mail1
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > group_print:   Resource Group: mail-services
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > native_print:      postfix    (ocf::heartbeat:postfix):       FAILED
> mail1
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > master_color: Promoting mail:0 (Master mail1)
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > master_color: mail-clone: Promoted 1 instances of a possible 1 to master
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > master_color: Promoting spool:0 (Master mail1)
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > master_color: spool-clone: Promoted 1 instances of a possible 1 to master
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > RecurringOp:   Start recurring monitor (45s) for postfix on mail1
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > LogActions:   Leave   virtualip-1     (Started mail1)
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > LogActions:   Leave   spool:0 (Master mail1)
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > LogActions:   Leave   spool:1 (Slave mail2)
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > LogActions:   Leave   mail:0  (Master mail1)
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > LogActions:   Leave   mail:1  (Slave mail2)
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > LogActions:   Leave   fs-spool        (Started mail1)
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > LogActions:   Leave   fs-mail (Started mail1)
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:   notice:
> > LogActions:   Recover postfix (Started mail1)
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:   notice:
> > process_pe_message:   Calculated Transition 2963:
> > /var/lib/pacemaker/pengine/pe-input-330.bz2
> > Mar 16 11:38:06 [7420] HWJ-626.domain.local       crmd:     info:
> > handle_response:      pe_calc calculation pe_calc-dc-1458124686-5541 is
> > obsolete
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:   notice:
> > unpack_config:        On loss of CCM Quorum: Ignore
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > determine_online_status:      Node mail1 is online
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > determine_online_status:      Node mail2 is online
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > determine_op_status:  Operation monitor found resource mail:0 active in
> > master mode on mail1
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > determine_op_status:  Operation monitor found resource spool:0 active in
> > master mode on mail1
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > determine_op_status:  Operation monitor found resource fs-spool active on
> > mail1
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > determine_op_status:  Operation monitor found resource fs-spool active on
> > mail1
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > determine_op_status:  Operation monitor found resource fs-mail active on
> > mail1
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > determine_op_status:  Operation monitor found resource fs-mail active on
> > mail1
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:  warning:
> > unpack_rsc_op_failure:        Processing failed op monitor for postfix on
> > mail1: not running (7)
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > determine_op_status:  Operation monitor found resource spool:1 active in
> > master mode on mail2
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > determine_op_status:  Operation monitor found resource mail:1 active in
> > master mode on mail2
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > group_print:   Resource Group: network-services
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > native_print:      virtualip-1        (ocf::heartbeat:IPaddr2):
>  Started
> > mail1
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > clone_print:   Master/Slave Set: spool-clone [spool]
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > short_print:       Masters: [ mail1 ]
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > short_print:       Slaves: [ mail2 ]
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > clone_print:   Master/Slave Set: mail-clone [mail]
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > short_print:       Masters: [ mail1 ]
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > short_print:       Slaves: [ mail2 ]
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > group_print:   Resource Group: fs-services
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > native_print:      fs-spool   (ocf::heartbeat:Filesystem):    Started
> mail1
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > native_print:      fs-mail    (ocf::heartbeat:Filesystem):    Started
> mail1
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > group_print:   Resource Group: mail-services
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > native_print:      postfix    (ocf::heartbeat:postfix):       FAILED
> mail1
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > get_failcount_full:   postfix has failed 1 times on mail1
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:  warning:
> > common_apply_stickiness:      Forcing postfix away from mail1 after 1
> > failures (max=1)
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > master_color: Promoting mail:0 (Master mail1)
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > master_color: mail-clone: Promoted 1 instances of a possible 1 to master
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > master_color: Promoting spool:0 (Master mail1)
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > master_color: spool-clone: Promoted 1 instances of a possible 1 to master
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > rsc_merge_weights:    fs-mail: Rolling back scores from postfix
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > rsc_merge_weights:    postfix: Rolling back scores from virtualip-1
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > native_color: Resource postfix cannot run anywhere
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > native_color: Resource virtualip-1 cannot run anywhere
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:   notice:
> > LogActions:   Stop    virtualip-1     (mail1)
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > LogActions:   Leave   spool:0 (Master mail1)
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > LogActions:   Leave   spool:1 (Slave mail2)
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > LogActions:   Leave   mail:0  (Master mail1)
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > LogActions:   Leave   mail:1  (Slave mail2)
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > LogActions:   Leave   fs-spool        (Started mail1)
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> > LogActions:   Leave   fs-mail (Started mail1)
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:   notice:
> > LogActions:   Stop    postfix (mail1)
> > Mar 16 11:38:06 [7420] HWJ-626.domain.local       crmd:     info:
> > do_state_transition:  State transition S_POLICY_ENGINE ->
> > S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> > origin=handle_response ]
> > Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:   notice:
> > process_pe_message:   Calculated Transition 2964:
> > /var/lib/pacemaker/pengine/pe-input-331.bz2
> > Mar 16 11:38:06 [7420] HWJ-626.domain.local       crmd:     info:
> > do_te_invoke: Processing graph 2964 (ref=pe_calc-dc-1458124686-5542)
> > derived from /var/lib/pacemaker/pengine/pe-input-331.bz2
> > Mar 16 11:38:06 [7420] HWJ-626.domain.local       crmd:   notice:
> > te_rsc_command:       Initiating action 5: stop postfix_stop_0 on mail1
> > Mar 16 11:38:07 [7415] HWJ-626.domain.local        cib:     info:
> > cib_perform_op:       Diff: --- 0.215.10 2
> > Mar 16 11:38:07 [7415] HWJ-626.domain.local        cib:     info:
> > cib_perform_op:       Diff: +++ 0.215.11 (null)
> > Mar 16 11:38:07 [7415] HWJ-626.domain.local        cib:     info:
> > cib_perform_op:       +  /cib:  @num_updates=11
> > Mar 16 11:38:07 [7415] HWJ-626.domain.local        cib:     info:
> > cib_perform_op:       +
> >
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='postfix']/lrm_rsc_op[@id='postfix_last_0']:
> >  @operation_key=postfix_stop_0, @operation=stop,
> > @transition-key=5:2964:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> > @transition-magic=0:0;5:2964:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> > @call-id=1335, @last-run=1458124686, @last-rc-change=1458124686,
> > @exec-time=435
> > Mar 16 11:38:07 [7420] HWJ-626.domain.local       crmd:     info:
> > match_graph_event:    Action postfix_stop_0 (5) confirmed on mail1 (rc=0)
> > Mar 16 11:38:07 [7415] HWJ-626.domain.local        cib:     info:
> > cib_process_request:  Completed cib_modify operation for section status:
> OK
> > (rc=0, origin=mail1/crmd/254, version=0.215.11)
> > Mar 16 11:38:07 [7420] HWJ-626.domain.local       crmd:   notice:
> > te_rsc_command:       Initiating action 12: stop virtualip-1_stop_0 on
> mail1
> > Mar 16 11:38:07 [7415] HWJ-626.domain.local        cib:     info:
> > cib_perform_op:       Diff: --- 0.215.11 2
> > Mar 16 11:38:07 [7415] HWJ-626.domain.local        cib:     info:
> > cib_perform_op:       Diff: +++ 0.215.12 (null)
> > Mar 16 11:38:07 [7415] HWJ-626.domain.local        cib:     info:
> > cib_perform_op:       +  /cib:  @num_updates=12
> > Mar 16 11:38:07 [7415] HWJ-626.domain.local        cib:     info:
> > cib_perform_op:       +
> >
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='virtualip-1']/lrm_rsc_op[@id='virtualip-1_last_0']:
> >  @operation_key=virtualip-1_stop_0, @operation=stop,
> > @transition-key=12:2964:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> > @transition-magic=0:0;12:2964:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> > @call-id=1337, @last-run=1458124687, @last-rc-change=1458124687,
> > @exec-time=56
> > Mar 16 11:38:07 [7420] HWJ-626.domain.local       crmd:     info:
> > match_graph_event:    Action virtualip-1_stop_0 (12) confirmed on mail1
> > (rc=0)
> > Mar 16 11:38:07 [7415] HWJ-626.domain.local        cib:     info:
> > cib_process_request:  Completed cib_modify operation for section status:
> OK
> > (rc=0, origin=mail1/crmd/255, version=0.215.12)
> > Mar 16 11:38:07 [7420] HWJ-626.domain.local       crmd:   notice:
> > run_graph:    Transition 2964 (Complete=7, Pending=0, Fired=0, Skipped=0,
> > Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-331.bz2):
> Complete
> > Mar 16 11:38:07 [7420] HWJ-626.domain.local       crmd:     info: do_log:
> >     FSA: Input I_TE_SUCCESS from notify_crmd() received in state
> > S_TRANSITION_ENGINE
> > Mar 16 11:38:07 [7420] HWJ-626.domain.local       crmd:   notice:
> > do_state_transition:  State transition S_TRANSITION_ENGINE -> S_IDLE [
> > input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> > Mar 16 11:38:12 [7415] HWJ-626.domain.local        cib:     info:
> > cib_process_ping:     Reporting our current digest to mail2:
> > ed43bc3ecf0f15853900ba49fc514870 for 0.215.12 (0x152b110 0)
> >
> >
> > On Mon, Mar 14, 2016 at 6:44 PM, Ken Gaillot <kgaillot at redhat.com>
> wrote:
> >
> >> On 03/10/2016 09:49 AM, Lorand Kelemen wrote:
> >>> Dear List,
> >>>
> >>> After the creation and testing of a simple 2 node active-passive
> >>> drbd+postfix cluster nearly everything works flawlessly (standby,
> failure
> >>> of a filesystem resource + failover, splitbrain + manual recovery)
> >> however
> >>> when delibarately killing the postfix instance, after reaching the
> >>> migration threshold failover does not occur and resources revert to the
> >>> Stopped state (except the master-slave drbd resource, which works as
> >>> expected).
> >>>
> >>> Ordering and colocation is configured, STONITH and quorum disabled, the
> >>> goal is to always have one node running all the resources and at any
> sign
> >>> of error it should fail over to the passive node, nothing fancy.
> >>>
> >>> Is my configuration wrong or am I hitting a bug?
> >>>
> >>> All software from centos 7 + elrepo repositories.
> >>
> >> With these versions, you can set "two_node: 1" in
> >> /etc/corosync/corosync.conf (which will be done automatically if you
> >> used "pcs cluster setup" initially), and then you don't need to ignore
> >> quorum in pacemaker.
> >>
> >>> Regarding STONITH: the machines are running on free ESXi instances on
> >>> separate machines, so the Vmware fencing agents won't work because in
> the
> >>> free version the API is read-only.
> >>> Still trying to figure out a way to go, until then manual recovery +
> huge
> >>> arp cache times on the upstream firewall...
> >>>
> >>> Please find pe-input*.bz files attached, logs and config below. The
> >>> situation: on node mail1 postfix was killed 3 times (migration
> >> threshold),
> >>> it should have failed over to mail2.
> >>> When killing a filesystem resource three times this happens flawlessly.
> >>>
> >>> Thanks for your input!
> >>>
> >>> Best regards,
> >>> Lorand
> >>>
> >>>
> >>> Cluster Name: mailcluster
> >>> Corosync Nodes:
> >>>  mail1 mail2
> >>> Pacemaker Nodes:
> >>>  mail1 mail2
> >>>
> >>> Resources:
> >>>  Group: network-services
> >>>   Resource: virtualip-1 (class=ocf provider=heartbeat type=IPaddr2)
> >>>    Attributes: ip=10.20.64.10 cidr_netmask=24 nic=lan0
> >>>    Operations: start interval=0s timeout=20s
> >> (virtualip-1-start-interval-0s)
> >>>                stop interval=0s timeout=20s
> >> (virtualip-1-stop-interval-0s)
> >>>                monitor interval=30s (virtualip-1-monitor-interval-30s)
> >>>  Master: spool-clone
> >>>   Meta Attrs: master-max=1 master-node-max=1 clone-max=2
> clone-node-max=1
> >>> notify=true
> >>>   Resource: spool (class=ocf provider=linbit type=drbd)
> >>>    Attributes: drbd_resource=spool
> >>>    Operations: start interval=0s timeout=240 (spool-start-interval-0s)
> >>>                promote interval=0s timeout=90
> (spool-promote-interval-0s)
> >>>                demote interval=0s timeout=90 (spool-demote-interval-0s)
> >>>                stop interval=0s timeout=100 (spool-stop-interval-0s)
> >>>                monitor interval=10s (spool-monitor-interval-10s)
> >>>  Master: mail-clone
> >>>   Meta Attrs: master-max=1 master-node-max=1 clone-max=2
> clone-node-max=1
> >>> notify=true
> >>>   Resource: mail (class=ocf provider=linbit type=drbd)
> >>>    Attributes: drbd_resource=mail
> >>>    Operations: start interval=0s timeout=240 (mail-start-interval-0s)
> >>>                promote interval=0s timeout=90
> (mail-promote-interval-0s)
> >>>                demote interval=0s timeout=90 (mail-demote-interval-0s)
> >>>                stop interval=0s timeout=100 (mail-stop-interval-0s)
> >>>                monitor interval=10s (mail-monitor-interval-10s)
> >>>  Group: fs-services
> >>>   Resource: fs-spool (class=ocf provider=heartbeat type=Filesystem)
> >>>    Attributes: device=/dev/drbd0 directory=/var/spool/postfix
> fstype=ext4
> >>> options=nodev,nosuid,noexec
> >>>    Operations: start interval=0s timeout=60
> (fs-spool-start-interval-0s)
> >>>                stop interval=0s timeout=60 (fs-spool-stop-interval-0s)
> >>>                monitor interval=20 timeout=40
> >> (fs-spool-monitor-interval-20)
> >>>   Resource: fs-mail (class=ocf provider=heartbeat type=Filesystem)
> >>>    Attributes: device=/dev/drbd1 directory=/var/spool/mail fstype=ext4
> >>> options=nodev,nosuid,noexec
> >>>    Operations: start interval=0s timeout=60 (fs-mail-start-interval-0s)
> >>>                stop interval=0s timeout=60 (fs-mail-stop-interval-0s)
> >>>                monitor interval=20 timeout=40
> >> (fs-mail-monitor-interval-20)
> >>>  Group: mail-services
> >>>   Resource: postfix (class=ocf provider=heartbeat type=postfix)
> >>>    Operations: start interval=0s timeout=20s
> (postfix-start-interval-0s)
> >>>                stop interval=0s timeout=20s (postfix-stop-interval-0s)
> >>>                monitor interval=45s (postfix-monitor-interval-45s)
> >>>
> >>> Stonith Devices:
> >>> Fencing Levels:
> >>>
> >>> Location Constraints:
> >>> Ordering Constraints:
> >>>   start network-services then promote mail-clone (kind:Mandatory)
> >>> (id:order-network-services-mail-clone-mandatory)
> >>>   promote mail-clone then promote spool-clone (kind:Mandatory)
> >>> (id:order-mail-clone-spool-clone-mandatory)
> >>>   promote spool-clone then start fs-services (kind:Mandatory)
> >>> (id:order-spool-clone-fs-services-mandatory)
> >>>   start fs-services then start mail-services (kind:Mandatory)
> >>> (id:order-fs-services-mail-services-mandatory)
> >>> Colocation Constraints:
> >>>   network-services with spool-clone (score:INFINITY) (rsc-role:Started)
> >>> (with-rsc-role:Master)
> >> (id:colocation-network-services-spool-clone-INFINITY)
> >>>   network-services with mail-clone (score:INFINITY) (rsc-role:Started)
> >>> (with-rsc-role:Master)
> >> (id:colocation-network-services-mail-clone-INFINITY)
> >>>   network-services with fs-services (score:INFINITY)
> >>> (id:colocation-network-services-fs-services-INFINITY)
> >>>   network-services with mail-services (score:INFINITY)
> >>> (id:colocation-network-services-mail-services-INFINITY)
> >>
> >> I'm not sure whether it's causing your issue, but I would make the
> >> constraints reflect the logical relationships better.
> >>
> >> For example, network-services only needs to be colocated with
> >> mail-services logically; it's mail-services that needs to be with
> >> fs-services, and fs-services that needs to be with
> >> spool-clone/mail-clone master. In other words, don't make the
> >> highest-level resource depend on everything else, make each level depend
> >> on the level below it.
> >>
> >> Also, I would guess that the virtual IP only needs to be ordered before
> >> mail-services, and mail-clone and spool-clone could both be ordered
> >> before fs-services, rather than ordering mail-clone before spool-clone.
> >>
> >>> Resources Defaults:
> >>>  migration-threshold: 3
> >>> Operations Defaults:
> >>>  on-fail: restart
> >>>
> >>> Cluster Properties:
> >>>  cluster-infrastructure: corosync
> >>>  cluster-name: mailcluster
> >>>  cluster-recheck-interval: 5min
> >>>  dc-version: 1.1.13-10.el7_2.2-44eb2dd
> >>>  default-resource-stickiness: infinity
> >>>  have-watchdog: false
> >>>  last-lrm-refresh: 1457613674
> >>>  no-quorum-policy: ignore
> >>>  pe-error-series-max: 1024
> >>>  pe-input-series-max: 1024
> >>>  pe-warn-series-max: 1024
> >>>  stonith-enabled: false
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       Diff: --- 0.197.15 2
> >>> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       Diff: +++ 0.197.16 (null)
> >>> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       +  /cib:  @num_updates=16
> >>> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       +
> >>>
> >>
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='postfix']/lrm_rsc_op[@id='postfix_last_failure_0']:
> >>>  @transition-key=4:1234:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> >>> @transition-magic=0:7;4:1234:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> >>> @call-id=1274, @last-rc-change=1457613440
> >>> Mar 10 13:37:20 [7420] HWJ-626.domain.local       crmd:     info:
> >>> abort_transition_graph:       Transition aborted by
> postfix_monitor_45000
> >>> 'modify' on mail1: Inactive graph
> >>> (magic=0:7;4:1234:0:ae755a85-c250-498f-9c94-ddd8a7e2788a, cib=0.197.16,
> >>> source=process_graph_event:598, 1)
> >>> Mar 10 13:37:20 [7420] HWJ-626.domain.local       crmd:     info:
> >>> update_failcount:     Updating failcount for postfix on mail1 after
> >> failed
> >>> monitor: rc=7 (update=value++, time=1457613440)
> >>> Mar 10 13:37:20 [7418] HWJ-626.domain.local      attrd:     info:
> >>> attrd_client_update:  Expanded fail-count-postfix=value++ to 3
> >>> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_process_request:  Completed cib_modify operation for section
> status:
> >> OK
> >>> (rc=0, origin=mail1/crmd/196, version=0.197.16)
> >>> Mar 10 13:37:20 [7418] HWJ-626.domain.local      attrd:     info:
> >>> attrd_peer_update:    Setting fail-count-postfix[mail1]: 2 -> 3 from
> >> mail2
> >>> Mar 10 13:37:20 [7418] HWJ-626.domain.local      attrd:     info:
> >>> write_attribute:      Sent update 400 with 2 changes for
> >>> fail-count-postfix, id=<n/a>, set=(null)
> >>> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_process_request:  Forwarding cib_modify operation for section
> status
> >> to
> >>> master (origin=local/attrd/400)
> >>> Mar 10 13:37:20 [7420] HWJ-626.domain.local       crmd:     info:
> >>> process_graph_event:  Detected action (1234.4)
> >>> postfix_monitor_45000.1274=not running: failed
> >>> Mar 10 13:37:20 [7418] HWJ-626.domain.local      attrd:     info:
> >>> attrd_peer_update:    Setting last-failure-postfix[mail1]: 1457613347
> ->
> >>> 1457613440 from mail2
> >>> Mar 10 13:37:20 [7420] HWJ-626.domain.local       crmd:   notice:
> >>> do_state_transition:  State transition S_IDLE -> S_POLICY_ENGINE [
> >>> input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> >>> Mar 10 13:37:20 [7418] HWJ-626.domain.local      attrd:     info:
> >>> write_attribute:      Sent update 401 with 2 changes for
> >>> last-failure-postfix, id=<n/a>, set=(null)
> >>> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       Diff: --- 0.197.16 2
> >>> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       Diff: +++ 0.197.17 (null)
> >>> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       +  /cib:  @num_updates=17
> >>> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       +
> >>>
> >>
> /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-fail-count-postfix']:
> >>>  @value=3
> >>> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_process_request:  Completed cib_modify operation for section
> status:
> >> OK
> >>> (rc=0, origin=mail2/attrd/400, version=0.197.17)
> >>> Mar 10 13:37:20 [7420] HWJ-626.domain.local       crmd:     info:
> >>> abort_transition_graph:       Transition aborted by
> >>> status-1-fail-count-postfix, fail-count-postfix=3: Transient attribute
> >>> change (modify cib=0.197.17, source=abort_unless_down:319,
> >>>
> >>
> path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-fail-count-postfix'],
> >>> 1)
> >>> Mar 10 13:37:20 [7418] HWJ-626.domain.local      attrd:     info:
> >>> attrd_cib_callback:   Update 400 for fail-count-postfix: OK (0)
> >>> Mar 10 13:37:20 [7418] HWJ-626.domain.local      attrd:     info:
> >>> attrd_cib_callback:   Update 400 for fail-count-postfix[mail1]=3: OK
> (0)
> >>> Mar 10 13:37:20 [7418] HWJ-626.domain.local      attrd:     info:
> >>> attrd_cib_callback:   Update 400 for fail-count-postfix[mail2]=(null):
> OK
> >>> (0)
> >>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_process_request:  Forwarding cib_modify operation for section
> status
> >> to
> >>> master (origin=local/attrd/401)
> >>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       Diff: --- 0.197.17 2
> >>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       Diff: +++ 0.197.18 (null)
> >>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       +  /cib:  @num_updates=18
> >>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       +
> >>>
> >>
> /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-last-failure-postfix']:
> >>>  @value=1457613440
> >>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_process_request:  Completed cib_modify operation for section
> status:
> >> OK
> >>> (rc=0, origin=mail2/attrd/401, version=0.197.18)
> >>> Mar 10 13:37:21 [7418] HWJ-626.domain.local      attrd:     info:
> >>> attrd_cib_callback:   Update 401 for last-failure-postfix: OK (0)
> >>> Mar 10 13:37:21 [7418] HWJ-626.domain.local      attrd:     info:
> >>> attrd_cib_callback:   Update 401 for
> >>> last-failure-postfix[mail1]=1457613440: OK (0)
> >>> Mar 10 13:37:21 [7418] HWJ-626.domain.local      attrd:     info:
> >>> attrd_cib_callback:   Update 401 for
> >>> last-failure-postfix[mail2]=1457610376: OK (0)
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info:
> >>> abort_transition_graph:       Transition aborted by
> >>> status-1-last-failure-postfix, last-failure-postfix=1457613440:
> Transient
> >>> attribute change (modify cib=0.197.18, source=abort_unless_down:319,
> >>>
> >>
> path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-last-failure-postfix'],
> >>> 1)
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
> >>> unpack_config:        On loss of CCM Quorum: Ignore
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> determine_online_status:      Node mail1 is online
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> determine_online_status:      Node mail2 is online
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> determine_op_status:  Operation monitor found resource mail:0 active in
> >>> master mode on mail1
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> determine_op_status:  Operation monitor found resource spool:0 active
> in
> >>> master mode on mail1
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> determine_op_status:  Operation monitor found resource fs-spool active
> on
> >>> mail1
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> determine_op_status:  Operation monitor found resource fs-mail active
> on
> >>> mail1
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:  warning:
> >>> unpack_rsc_op_failure:        Processing failed op monitor for postfix
> on
> >>> mail1: not running (7)
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> determine_op_status:  Operation monitor found resource spool:1 active
> in
> >>> master mode on mail2
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> determine_op_status:  Operation monitor found resource mail:1 active in
> >>> master mode on mail2
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> group_print:   Resource Group: network-services
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> native_print:      virtualip-1        (ocf::heartbeat:IPaddr2):
> >>  Started
> >>> mail1
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> clone_print:   Master/Slave Set: spool-clone [spool]
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> short_print:       Masters: [ mail1 ]
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> short_print:       Slaves: [ mail2 ]
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> clone_print:   Master/Slave Set: mail-clone [mail]
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> short_print:       Masters: [ mail1 ]
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> short_print:       Slaves: [ mail2 ]
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> group_print:   Resource Group: fs-services
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> native_print:      fs-spool   (ocf::heartbeat:Filesystem):    Started
> >> mail1
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> native_print:      fs-mail    (ocf::heartbeat:Filesystem):    Started
> >> mail1
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> group_print:   Resource Group: mail-services
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> native_print:      postfix    (ocf::heartbeat:postfix):       FAILED
> >> mail1
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> get_failcount_full:   postfix has failed 3 times on mail1
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:  warning:
> >>> common_apply_stickiness:      Forcing postfix away from mail1 after 3
> >>> failures (max=3)
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> master_color: Promoting mail:0 (Master mail1)
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> master_color: mail-clone: Promoted 1 instances of a possible 1 to
> master
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> master_color: Promoting spool:0 (Master mail1)
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> master_color: spool-clone: Promoted 1 instances of a possible 1 to
> master
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> rsc_merge_weights:    postfix: Rolling back scores from virtualip-1
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> native_color: Resource virtualip-1 cannot run anywhere
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> RecurringOp:   Start recurring monitor (45s) for postfix on mail2
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
> >>> LogActions:   Stop    virtualip-1     (mail1)
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> LogActions:   Leave   spool:0 (Master mail1)
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> LogActions:   Leave   spool:1 (Slave mail2)
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> LogActions:   Leave   mail:0  (Master mail1)
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> LogActions:   Leave   mail:1  (Slave mail2)
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
> >>> LogActions:   Stop    fs-spool        (Started mail1)
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
> >>> LogActions:   Stop    fs-mail (Started mail1)
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
> >>> LogActions:   Stop    postfix (Started mail1)
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
> >>> process_pe_message:   Calculated Transition 1235:
> >>> /var/lib/pacemaker/pengine/pe-input-302.bz2
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info:
> >>> handle_response:      pe_calc calculation pe_calc-dc-1457613441-3756 is
> >>> obsolete
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
> >>> unpack_config:        On loss of CCM Quorum: Ignore
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> determine_online_status:      Node mail1 is online
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> determine_online_status:      Node mail2 is online
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> determine_op_status:  Operation monitor found resource mail:0 active in
> >>> master mode on mail1
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> determine_op_status:  Operation monitor found resource spool:0 active
> in
> >>> master mode on mail1
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> determine_op_status:  Operation monitor found resource fs-spool active
> on
> >>> mail1
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> determine_op_status:  Operation monitor found resource fs-mail active
> on
> >>> mail1
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:  warning:
> >>> unpack_rsc_op_failure:        Processing failed op monitor for postfix
> on
> >>> mail1: not running (7)
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> determine_op_status:  Operation monitor found resource spool:1 active
> in
> >>> master mode on mail2
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> determine_op_status:  Operation monitor found resource mail:1 active in
> >>> master mode on mail2
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> group_print:   Resource Group: network-services
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> native_print:      virtualip-1        (ocf::heartbeat:IPaddr2):
> >>  Started
> >>> mail1
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> clone_print:   Master/Slave Set: spool-clone [spool]
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> short_print:       Masters: [ mail1 ]
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> short_print:       Slaves: [ mail2 ]
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> clone_print:   Master/Slave Set: mail-clone [mail]
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> short_print:       Masters: [ mail1 ]
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> short_print:       Slaves: [ mail2 ]
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> group_print:   Resource Group: fs-services
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> native_print:      fs-spool   (ocf::heartbeat:Filesystem):    Started
> >> mail1
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> native_print:      fs-mail    (ocf::heartbeat:Filesystem):    Started
> >> mail1
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> group_print:   Resource Group: mail-services
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> native_print:      postfix    (ocf::heartbeat:postfix):       FAILED
> >> mail1
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> get_failcount_full:   postfix has failed 3 times on mail1
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:  warning:
> >>> common_apply_stickiness:      Forcing postfix away from mail1 after 3
> >>> failures (max=3)
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> master_color: Promoting mail:0 (Master mail1)
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> master_color: mail-clone: Promoted 1 instances of a possible 1 to
> master
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> master_color: Promoting spool:0 (Master mail1)
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> master_color: spool-clone: Promoted 1 instances of a possible 1 to
> master
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> rsc_merge_weights:    postfix: Rolling back scores from virtualip-1
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> native_color: Resource virtualip-1 cannot run anywhere
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> RecurringOp:   Start recurring monitor (45s) for postfix on mail2
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
> >>> LogActions:   Stop    virtualip-1     (mail1)
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> LogActions:   Leave   spool:0 (Master mail1)
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> LogActions:   Leave   spool:1 (Slave mail2)
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> LogActions:   Leave   mail:0  (Master mail1)
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> >>> LogActions:   Leave   mail:1  (Slave mail2)
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
> >>> LogActions:   Stop    fs-spool        (Started mail1)
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
> >>> LogActions:   Stop    fs-mail (Started mail1)
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
> >>> LogActions:   Stop    postfix (Started mail1)
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info:
> >>> do_state_transition:  State transition S_POLICY_ENGINE ->
> >>> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> >>> origin=handle_response ]
> >>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
> >>> process_pe_message:   Calculated Transition 1236:
> >>> /var/lib/pacemaker/pengine/pe-input-303.bz2
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info:
> >>> do_te_invoke: Processing graph 1236 (ref=pe_calc-dc-1457613441-3757)
> >>> derived from /var/lib/pacemaker/pengine/pe-input-303.bz2
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> >>> te_rsc_command:       Initiating action 12: stop virtualip-1_stop_0 on
> >> mail1
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> >>> te_rsc_command:       Initiating action 5: stop postfix_stop_0 on mail1
> >>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       Diff: --- 0.197.18 2
> >>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       Diff: +++ 0.197.19 (null)
> >>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       +  /cib:  @num_updates=19
> >>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       +
> >>>
> >>
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='virtualip-1']/lrm_rsc_op[@id='virtualip-1_last_0']:
> >>>  @operation_key=virtualip-1_stop_0, @operation=stop,
> >>> @transition-key=12:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> >>> @transition-magic=0:0;12:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> >>> @call-id=1276, @last-run=1457613441, @last-rc-change=1457613441,
> >>> @exec-time=66
> >>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_process_request:  Completed cib_modify operation for section
> status:
> >> OK
> >>> (rc=0, origin=mail1/crmd/197, version=0.197.19)
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info:
> >>> match_graph_event:    Action virtualip-1_stop_0 (12) confirmed on mail1
> >>> (rc=0)
> >>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       Diff: --- 0.197.19 2
> >>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       Diff: +++ 0.197.20 (null)
> >>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       +  /cib:  @num_updates=20
> >>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       +
> >>>
> >>
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='postfix']/lrm_rsc_op[@id='postfix_last_0']:
> >>>  @operation_key=postfix_stop_0, @operation=stop,
> >>> @transition-key=5:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> >>> @transition-magic=0:0;5:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> >>> @call-id=1278, @last-run=1457613441, @last-rc-change=1457613441,
> >>> @exec-time=476
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info:
> >>> match_graph_event:    Action postfix_stop_0 (5) confirmed on mail1
> (rc=0)
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> >>> te_rsc_command:       Initiating action 79: stop fs-mail_stop_0 on
> mail1
> >>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_process_request:  Completed cib_modify operation for section
> status:
> >> OK
> >>> (rc=0, origin=mail1/crmd/198, version=0.197.20)
> >>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       Diff: --- 0.197.20 2
> >>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       Diff: +++ 0.197.21 (null)
> >>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       +  /cib:  @num_updates=21
> >>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       +
> >>>
> >>
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='fs-mail']/lrm_rsc_op[@id='fs-mail_last_0']:
> >>>  @operation_key=fs-mail_stop_0, @operation=stop,
> >>> @transition-key=79:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> >>> @transition-magic=0:0;79:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> >>> @call-id=1280, @last-run=1457613441, @last-rc-change=1457613441,
> >>> @exec-time=88, @queue-time=1
> >>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_process_request:  Completed cib_modify operation for section
> status:
> >> OK
> >>> (rc=0, origin=mail1/crmd/199, version=0.197.21)
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info:
> >>> match_graph_event:    Action fs-mail_stop_0 (79) confirmed on mail1
> >> (rc=0)
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> >>> te_rsc_command:       Initiating action 77: stop fs-spool_stop_0 on
> mail1
> >>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       Diff: --- 0.197.21 2
> >>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       Diff: +++ 0.197.22 (null)
> >>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       +  /cib:  @num_updates=22
> >>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_perform_op:       +
> >>>
> >>
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='fs-spool']/lrm_rsc_op[@id='fs-spool_last_0']:
> >>>  @operation_key=fs-spool_stop_0, @operation=stop,
> >>> @transition-key=77:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> >>> @transition-magic=0:0;77:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> >>> @call-id=1282, @last-run=1457613441, @last-rc-change=1457613441,
> >>> @exec-time=86
> >>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_process_request:  Completed cib_modify operation for section
> status:
> >> OK
> >>> (rc=0, origin=mail1/crmd/200, version=0.197.22)
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info:
> >>> match_graph_event:    Action fs-spool_stop_0 (77) confirmed on mail1
> >> (rc=0)
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:  warning:
> >>> run_graph:    Transition 1236 (Complete=11, Pending=0, Fired=0,
> >> Skipped=0,
> >>> Incomplete=1, Source=/var/lib/pacemaker/pengine/pe-input-303.bz2):
> >>> Terminated
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:  warning:
> >>> te_graph_trigger:     Transition failed: terminated
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> >>> print_graph:  Graph 1236 with 12 actions: batch-limit=12 jobs,
> >>> network-delay=0ms
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> >>> print_synapse:        [Action   16]: Completed pseudo op
> >>> network-services_stopped_0     on N/A (priority: 0, waiting: none)
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> >>> print_synapse:        [Action   15]: Completed pseudo op
> >>> network-services_stop_0        on N/A (priority: 0, waiting: none)
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> >>> print_synapse:        [Action   12]: Completed rsc op
> virtualip-1_stop_0
> >>>              on mail1 (priority: 0, waiting: none)
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> >>> print_synapse:        [Action   84]: Completed pseudo op
> >>> fs-services_stopped_0          on N/A (priority: 0, waiting: none)
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> >>> print_synapse:        [Action   83]: Completed pseudo op
> >> fs-services_stop_0
> >>>             on N/A (priority: 0, waiting: none)
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> >>> print_synapse:        [Action   77]: Completed rsc op fs-spool_stop_0
> >>>             on mail1 (priority: 0, waiting: none)
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> >>> print_synapse:        [Action   79]: Completed rsc op fs-mail_stop_0
> >>>              on mail1 (priority: 0, waiting: none)
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> >>> print_synapse:        [Action   90]: Completed pseudo op
> >>> mail-services_stopped_0        on N/A (priority: 0, waiting: none)
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> >>> print_synapse:        [Action   89]: Completed pseudo op
> >>> mail-services_stop_0           on N/A (priority: 0, waiting: none)
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> >>> print_synapse:        [Action   86]: Pending rsc op
> postfix_monitor_45000
> >>>             on mail2 (priority: 0, waiting: none)
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> >>> print_synapse:         * [Input 85]: Unresolved dependency rsc op
> >>> postfix_start_0 on mail2
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> >>> print_synapse:        [Action    5]: Completed rsc op postfix_stop_0
> >>>              on mail1 (priority: 0, waiting: none)
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> >>> print_synapse:        [Action    8]: Completed pseudo op all_stopped
> >>>              on N/A (priority: 0, waiting: none)
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info:
> do_log:
> >>>     FSA: Input I_TE_SUCCESS from notify_crmd() received in state
> >>> S_TRANSITION_ENGINE
> >>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> >>> do_state_transition:  State transition S_TRANSITION_ENGINE -> S_IDLE [
> >>> input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> >>> Mar 10 13:37:26 [7415] HWJ-626.domain.local        cib:     info:
> >>> cib_process_ping:     Reporting our current digest to mail2:
> >>> 3896ee29cdb6ba128330b0ef6e41bd79 for 0.197.22 (0x1544a30 0)
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://clusterlabs.org/pipermail/users/attachments/20160316/dca93a3c/attachment-0001.html>