[ClusterLabs] Cluster failover failure with Unresolved dependency

Ken Gaillot kgaillot at redhat.com
Wed Mar 16 16:34:46 CET 2016


On 03/16/2016 05:49 AM, Lorand Kelemen wrote:
> Dear Ken,
> 
> Thanks for the reply! I lowered migration-threshold to 1 and rearranged
> contraints like you suggested:
> 
> Location Constraints:
> Ordering Constraints:
>   promote mail-clone then start fs-services (kind:Mandatory)
>   promote spool-clone then start fs-services (kind:Mandatory)
>   start fs-services then start network-services (kind:Mandatory)

Certainly not a big deal, but I would change the above constraint to
start fs-services then start mail-services. The IP doesn't care whether
the filesystems are up yet or not, but postfix does.

>   start network-services then start mail-services (kind:Mandatory)
> Colocation Constraints:
>   fs-services with spool-clone (score:INFINITY) (rsc-role:Started)
> (with-rsc-role:Master)
>   fs-services with mail-clone (score:INFINITY) (rsc-role:Started)
> (with-rsc-role:Master)
>   network-services with mail-services (score:INFINITY)
>   mail-services with fs-services (score:INFINITY)
> 
> Now virtualip and postfix becomes stopped, I guess these are relevant but I
> attach also full logs:
> 
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> native_color: Resource postfix cannot run anywhere
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> native_color: Resource virtualip-1 cannot run anywhere
> 
> Interesting, will try to play around with ordering - colocation, the
> solution must be in these settings...
> 
> Best regards,
> Lorand
> 
> Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       Diff: --- 0.215.7 2
> Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       Diff: +++ 0.215.8 (null)
> Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       +  /cib:  @num_updates=8
> Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       ++
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='postfix']:
>  <lrm_rsc_op id="postfix_last_failure_0"
> operation_key="postfix_monitor_45000" operation="monitor"
> crm-debug-origin="do_update_resource" crm_feature_set="3.0.10"
> transition-key="86:2962:0:ae755a85-c250-498f-9c94-ddd8a7e2788a"
> transition-magic="0:7;86:2962:0:ae755a85-c250-498f-9c94-ddd8a7e2788a"
> on_node="mail1" call-id="1333" rc-code="7"
> Mar 16 11:38:06 [7420] HWJ-626.domain.local       crmd:     info:
> abort_transition_graph:       Transition aborted by postfix_monitor_45000
> 'create' on mail1: Inactive graph
> (magic=0:7;86:2962:0:ae755a85-c250-498f-9c94-ddd8a7e2788a, cib=0.215.8,
> source=process_graph_event:598, 1)
> Mar 16 11:38:06 [7420] HWJ-626.domain.local       crmd:     info:
> update_failcount:     Updating failcount for postfix on mail1 after failed
> monitor: rc=7 (update=value++, time=1458124686)

I don't think your constraints are causing problems now; the above
message indicates that the postfix resource failed. Postfix may not be
able to run anywhere because it's already failed on both nodes, and the
IP would be down because it has to be colocated with postfix, and
postfix can't run.

The rc=7 above indicates that the postfix agent's monitor operation
returned 7, which is "not running". I'd check the logs for postfix errors.

> Mar 16 11:38:06 [7420] HWJ-626.domain.local       crmd:     info:
> process_graph_event:  Detected action (2962.86)
> postfix_monitor_45000.1333=not running: failed
> Mar 16 11:38:06 [7418] HWJ-626.domain.local      attrd:     info:
> attrd_client_update:  Expanded fail-count-postfix=value++ to 1
> Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> cib_process_request:  Completed cib_modify operation for section status: OK
> (rc=0, origin=mail1/crmd/253, version=0.215.8)
> Mar 16 11:38:06 [7418] HWJ-626.domain.local      attrd:     info:
> attrd_peer_update:    Setting fail-count-postfix[mail1]: (null) -> 1 from
> mail2
> Mar 16 11:38:06 [7420] HWJ-626.domain.local       crmd:   notice:
> do_state_transition:  State transition S_IDLE -> S_POLICY_ENGINE [
> input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Mar 16 11:38:06 [7418] HWJ-626.domain.local      attrd:     info:
> write_attribute:      Sent update 406 with 2 changes for
> fail-count-postfix, id=<n/a>, set=(null)
> Mar 16 11:38:06 [7418] HWJ-626.domain.local      attrd:     info:
> attrd_peer_update:    Setting last-failure-postfix[mail1]: 1458124291 ->
> 1458124686 from mail2
> Mar 16 11:38:06 [7418] HWJ-626.domain.local      attrd:     info:
> write_attribute:      Sent update 407 with 2 changes for
> last-failure-postfix, id=<n/a>, set=(null)
> Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> cib_process_request:  Forwarding cib_modify operation for section status to
> master (origin=local/attrd/406)
> Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> cib_process_request:  Forwarding cib_modify operation for section status to
> master (origin=local/attrd/407)
> Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       Diff: --- 0.215.8 2
> Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       Diff: +++ 0.215.9 (null)
> Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       +  /cib:  @num_updates=9
> Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       ++
> /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']:
>  <nvpair id="status-1-fail-count-postfix" name="fail-count-postfix"
> value="1"/>
> Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> cib_process_request:  Completed cib_modify operation for section status: OK
> (rc=0, origin=mail2/attrd/406, version=0.215.9)
> Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       Diff: --- 0.215.9 2
> Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       Diff: +++ 0.215.10 (null)
> Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       +  /cib:  @num_updates=10
> Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       +
>  /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-last-failure-postfix']:
>  @value=1458124686
> Mar 16 11:38:06 [7418] HWJ-626.domain.local      attrd:     info:
> attrd_cib_callback:   Update 406 for fail-count-postfix: OK (0)
> Mar 16 11:38:06 [7418] HWJ-626.domain.local      attrd:     info:
> attrd_cib_callback:   Update 406 for fail-count-postfix[mail1]=1: OK (0)
> Mar 16 11:38:06 [7415] HWJ-626.domain.local        cib:     info:
> cib_process_request:  Completed cib_modify operation for section status: OK
> (rc=0, origin=mail2/attrd/407, version=0.215.10)
> Mar 16 11:38:06 [7418] HWJ-626.domain.local      attrd:     info:
> attrd_cib_callback:   Update 406 for fail-count-postfix[mail2]=(null): OK
> (0)
> Mar 16 11:38:06 [7418] HWJ-626.domain.local      attrd:     info:
> attrd_cib_callback:   Update 407 for last-failure-postfix: OK (0)
> Mar 16 11:38:06 [7418] HWJ-626.domain.local      attrd:     info:
> attrd_cib_callback:   Update 407 for
> last-failure-postfix[mail1]=1458124686: OK (0)
> Mar 16 11:38:06 [7418] HWJ-626.domain.local      attrd:     info:
> attrd_cib_callback:   Update 407 for
> last-failure-postfix[mail2]=1457610376: OK (0)
> Mar 16 11:38:06 [7420] HWJ-626.domain.local       crmd:     info:
> abort_transition_graph:       Transition aborted by
> status-1-fail-count-postfix, fail-count-postfix=1: Transient attribute
> change (create cib=0.215.9, source=abort_unless_down:319,
> path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1'],
> 1)
> Mar 16 11:38:06 [7420] HWJ-626.domain.local       crmd:     info:
> abort_transition_graph:       Transition aborted by
> status-1-last-failure-postfix, last-failure-postfix=1458124686: Transient
> attribute change (modify cib=0.215.10, source=abort_unless_down:319,
> path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-last-failure-postfix'],
> 1)
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:   notice:
> unpack_config:        On loss of CCM Quorum: Ignore
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> determine_online_status:      Node mail1 is online
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> determine_online_status:      Node mail2 is online
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource mail:0 active in
> master mode on mail1
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource spool:0 active in
> master mode on mail1
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource fs-spool active on
> mail1
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource fs-spool active on
> mail1
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource fs-mail active on
> mail1
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource fs-mail active on
> mail1
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:  warning:
> unpack_rsc_op_failure:        Processing failed op monitor for postfix on
> mail1: not running (7)
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource spool:1 active in
> master mode on mail2
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource mail:1 active in
> master mode on mail2
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> group_print:   Resource Group: network-services
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> native_print:      virtualip-1        (ocf::heartbeat:IPaddr2):     Started
> mail1
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> clone_print:   Master/Slave Set: spool-clone [spool]
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> short_print:       Masters: [ mail1 ]
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> short_print:       Slaves: [ mail2 ]
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> clone_print:   Master/Slave Set: mail-clone [mail]
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> short_print:       Masters: [ mail1 ]
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> short_print:       Slaves: [ mail2 ]
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> group_print:   Resource Group: fs-services
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> native_print:      fs-spool   (ocf::heartbeat:Filesystem):    Started mail1
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> native_print:      fs-mail    (ocf::heartbeat:Filesystem):    Started mail1
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> group_print:   Resource Group: mail-services
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> native_print:      postfix    (ocf::heartbeat:postfix):       FAILED mail1
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> master_color: Promoting mail:0 (Master mail1)
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> master_color: mail-clone: Promoted 1 instances of a possible 1 to master
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> master_color: Promoting spool:0 (Master mail1)
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> master_color: spool-clone: Promoted 1 instances of a possible 1 to master
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> RecurringOp:   Start recurring monitor (45s) for postfix on mail1
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> LogActions:   Leave   virtualip-1     (Started mail1)
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> LogActions:   Leave   spool:0 (Master mail1)
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> LogActions:   Leave   spool:1 (Slave mail2)
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> LogActions:   Leave   mail:0  (Master mail1)
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> LogActions:   Leave   mail:1  (Slave mail2)
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> LogActions:   Leave   fs-spool        (Started mail1)
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> LogActions:   Leave   fs-mail (Started mail1)
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:   notice:
> LogActions:   Recover postfix (Started mail1)
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:   notice:
> process_pe_message:   Calculated Transition 2963:
> /var/lib/pacemaker/pengine/pe-input-330.bz2
> Mar 16 11:38:06 [7420] HWJ-626.domain.local       crmd:     info:
> handle_response:      pe_calc calculation pe_calc-dc-1458124686-5541 is
> obsolete
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:   notice:
> unpack_config:        On loss of CCM Quorum: Ignore
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> determine_online_status:      Node mail1 is online
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> determine_online_status:      Node mail2 is online
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource mail:0 active in
> master mode on mail1
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource spool:0 active in
> master mode on mail1
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource fs-spool active on
> mail1
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource fs-spool active on
> mail1
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource fs-mail active on
> mail1
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource fs-mail active on
> mail1
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:  warning:
> unpack_rsc_op_failure:        Processing failed op monitor for postfix on
> mail1: not running (7)
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource spool:1 active in
> master mode on mail2
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource mail:1 active in
> master mode on mail2
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> group_print:   Resource Group: network-services
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> native_print:      virtualip-1        (ocf::heartbeat:IPaddr2):     Started
> mail1
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> clone_print:   Master/Slave Set: spool-clone [spool]
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> short_print:       Masters: [ mail1 ]
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> short_print:       Slaves: [ mail2 ]
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> clone_print:   Master/Slave Set: mail-clone [mail]
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> short_print:       Masters: [ mail1 ]
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> short_print:       Slaves: [ mail2 ]
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> group_print:   Resource Group: fs-services
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> native_print:      fs-spool   (ocf::heartbeat:Filesystem):    Started mail1
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> native_print:      fs-mail    (ocf::heartbeat:Filesystem):    Started mail1
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> group_print:   Resource Group: mail-services
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> native_print:      postfix    (ocf::heartbeat:postfix):       FAILED mail1
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> get_failcount_full:   postfix has failed 1 times on mail1
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:  warning:
> common_apply_stickiness:      Forcing postfix away from mail1 after 1
> failures (max=1)
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> master_color: Promoting mail:0 (Master mail1)
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> master_color: mail-clone: Promoted 1 instances of a possible 1 to master
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> master_color: Promoting spool:0 (Master mail1)
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> master_color: spool-clone: Promoted 1 instances of a possible 1 to master
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> rsc_merge_weights:    fs-mail: Rolling back scores from postfix
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> rsc_merge_weights:    postfix: Rolling back scores from virtualip-1
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> native_color: Resource postfix cannot run anywhere
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> native_color: Resource virtualip-1 cannot run anywhere
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:   notice:
> LogActions:   Stop    virtualip-1     (mail1)
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> LogActions:   Leave   spool:0 (Master mail1)
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> LogActions:   Leave   spool:1 (Slave mail2)
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> LogActions:   Leave   mail:0  (Master mail1)
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> LogActions:   Leave   mail:1  (Slave mail2)
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> LogActions:   Leave   fs-spool        (Started mail1)
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:     info:
> LogActions:   Leave   fs-mail (Started mail1)
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:   notice:
> LogActions:   Stop    postfix (mail1)
> Mar 16 11:38:06 [7420] HWJ-626.domain.local       crmd:     info:
> do_state_transition:  State transition S_POLICY_ENGINE ->
> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> origin=handle_response ]
> Mar 16 11:38:06 [7419] HWJ-626.domain.local    pengine:   notice:
> process_pe_message:   Calculated Transition 2964:
> /var/lib/pacemaker/pengine/pe-input-331.bz2
> Mar 16 11:38:06 [7420] HWJ-626.domain.local       crmd:     info:
> do_te_invoke: Processing graph 2964 (ref=pe_calc-dc-1458124686-5542)
> derived from /var/lib/pacemaker/pengine/pe-input-331.bz2
> Mar 16 11:38:06 [7420] HWJ-626.domain.local       crmd:   notice:
> te_rsc_command:       Initiating action 5: stop postfix_stop_0 on mail1
> Mar 16 11:38:07 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       Diff: --- 0.215.10 2
> Mar 16 11:38:07 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       Diff: +++ 0.215.11 (null)
> Mar 16 11:38:07 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       +  /cib:  @num_updates=11
> Mar 16 11:38:07 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       +
>  /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='postfix']/lrm_rsc_op[@id='postfix_last_0']:
>  @operation_key=postfix_stop_0, @operation=stop,
> @transition-key=5:2964:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> @transition-magic=0:0;5:2964:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> @call-id=1335, @last-run=1458124686, @last-rc-change=1458124686,
> @exec-time=435
> Mar 16 11:38:07 [7420] HWJ-626.domain.local       crmd:     info:
> match_graph_event:    Action postfix_stop_0 (5) confirmed on mail1 (rc=0)
> Mar 16 11:38:07 [7415] HWJ-626.domain.local        cib:     info:
> cib_process_request:  Completed cib_modify operation for section status: OK
> (rc=0, origin=mail1/crmd/254, version=0.215.11)
> Mar 16 11:38:07 [7420] HWJ-626.domain.local       crmd:   notice:
> te_rsc_command:       Initiating action 12: stop virtualip-1_stop_0 on mail1
> Mar 16 11:38:07 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       Diff: --- 0.215.11 2
> Mar 16 11:38:07 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       Diff: +++ 0.215.12 (null)
> Mar 16 11:38:07 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       +  /cib:  @num_updates=12
> Mar 16 11:38:07 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       +
>  /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='virtualip-1']/lrm_rsc_op[@id='virtualip-1_last_0']:
>  @operation_key=virtualip-1_stop_0, @operation=stop,
> @transition-key=12:2964:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> @transition-magic=0:0;12:2964:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> @call-id=1337, @last-run=1458124687, @last-rc-change=1458124687,
> @exec-time=56
> Mar 16 11:38:07 [7420] HWJ-626.domain.local       crmd:     info:
> match_graph_event:    Action virtualip-1_stop_0 (12) confirmed on mail1
> (rc=0)
> Mar 16 11:38:07 [7415] HWJ-626.domain.local        cib:     info:
> cib_process_request:  Completed cib_modify operation for section status: OK
> (rc=0, origin=mail1/crmd/255, version=0.215.12)
> Mar 16 11:38:07 [7420] HWJ-626.domain.local       crmd:   notice:
> run_graph:    Transition 2964 (Complete=7, Pending=0, Fired=0, Skipped=0,
> Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-331.bz2): Complete
> Mar 16 11:38:07 [7420] HWJ-626.domain.local       crmd:     info: do_log:
>     FSA: Input I_TE_SUCCESS from notify_crmd() received in state
> S_TRANSITION_ENGINE
> Mar 16 11:38:07 [7420] HWJ-626.domain.local       crmd:   notice:
> do_state_transition:  State transition S_TRANSITION_ENGINE -> S_IDLE [
> input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> Mar 16 11:38:12 [7415] HWJ-626.domain.local        cib:     info:
> cib_process_ping:     Reporting our current digest to mail2:
> ed43bc3ecf0f15853900ba49fc514870 for 0.215.12 (0x152b110 0)
> 
> 
> On Mon, Mar 14, 2016 at 6:44 PM, Ken Gaillot <kgaillot at redhat.com> wrote:
> 
>> On 03/10/2016 09:49 AM, Lorand Kelemen wrote:
>>> Dear List,
>>>
>>> After the creation and testing of a simple 2 node active-passive
>>> drbd+postfix cluster nearly everything works flawlessly (standby, failure
>>> of a filesystem resource + failover, splitbrain + manual recovery)
>> however
>>> when delibarately killing the postfix instance, after reaching the
>>> migration threshold failover does not occur and resources revert to the
>>> Stopped state (except the master-slave drbd resource, which works as
>>> expected).
>>>
>>> Ordering and colocation is configured, STONITH and quorum disabled, the
>>> goal is to always have one node running all the resources and at any sign
>>> of error it should fail over to the passive node, nothing fancy.
>>>
>>> Is my configuration wrong or am I hitting a bug?
>>>
>>> All software from centos 7 + elrepo repositories.
>>
>> With these versions, you can set "two_node: 1" in
>> /etc/corosync/corosync.conf (which will be done automatically if you
>> used "pcs cluster setup" initially), and then you don't need to ignore
>> quorum in pacemaker.
>>
>>> Regarding STONITH: the machines are running on free ESXi instances on
>>> separate machines, so the Vmware fencing agents won't work because in the
>>> free version the API is read-only.
>>> Still trying to figure out a way to go, until then manual recovery + huge
>>> arp cache times on the upstream firewall...
>>>
>>> Please find pe-input*.bz files attached, logs and config below. The
>>> situation: on node mail1 postfix was killed 3 times (migration
>> threshold),
>>> it should have failed over to mail2.
>>> When killing a filesystem resource three times this happens flawlessly.
>>>
>>> Thanks for your input!
>>>
>>> Best regards,
>>> Lorand
>>>
>>>
>>> Cluster Name: mailcluster
>>> Corosync Nodes:
>>>  mail1 mail2
>>> Pacemaker Nodes:
>>>  mail1 mail2
>>>
>>> Resources:
>>>  Group: network-services
>>>   Resource: virtualip-1 (class=ocf provider=heartbeat type=IPaddr2)
>>>    Attributes: ip=10.20.64.10 cidr_netmask=24 nic=lan0
>>>    Operations: start interval=0s timeout=20s
>> (virtualip-1-start-interval-0s)
>>>                stop interval=0s timeout=20s
>> (virtualip-1-stop-interval-0s)
>>>                monitor interval=30s (virtualip-1-monitor-interval-30s)
>>>  Master: spool-clone
>>>   Meta Attrs: master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
>>> notify=true
>>>   Resource: spool (class=ocf provider=linbit type=drbd)
>>>    Attributes: drbd_resource=spool
>>>    Operations: start interval=0s timeout=240 (spool-start-interval-0s)
>>>                promote interval=0s timeout=90 (spool-promote-interval-0s)
>>>                demote interval=0s timeout=90 (spool-demote-interval-0s)
>>>                stop interval=0s timeout=100 (spool-stop-interval-0s)
>>>                monitor interval=10s (spool-monitor-interval-10s)
>>>  Master: mail-clone
>>>   Meta Attrs: master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
>>> notify=true
>>>   Resource: mail (class=ocf provider=linbit type=drbd)
>>>    Attributes: drbd_resource=mail
>>>    Operations: start interval=0s timeout=240 (mail-start-interval-0s)
>>>                promote interval=0s timeout=90 (mail-promote-interval-0s)
>>>                demote interval=0s timeout=90 (mail-demote-interval-0s)
>>>                stop interval=0s timeout=100 (mail-stop-interval-0s)
>>>                monitor interval=10s (mail-monitor-interval-10s)
>>>  Group: fs-services
>>>   Resource: fs-spool (class=ocf provider=heartbeat type=Filesystem)
>>>    Attributes: device=/dev/drbd0 directory=/var/spool/postfix fstype=ext4
>>> options=nodev,nosuid,noexec
>>>    Operations: start interval=0s timeout=60 (fs-spool-start-interval-0s)
>>>                stop interval=0s timeout=60 (fs-spool-stop-interval-0s)
>>>                monitor interval=20 timeout=40
>> (fs-spool-monitor-interval-20)
>>>   Resource: fs-mail (class=ocf provider=heartbeat type=Filesystem)
>>>    Attributes: device=/dev/drbd1 directory=/var/spool/mail fstype=ext4
>>> options=nodev,nosuid,noexec
>>>    Operations: start interval=0s timeout=60 (fs-mail-start-interval-0s)
>>>                stop interval=0s timeout=60 (fs-mail-stop-interval-0s)
>>>                monitor interval=20 timeout=40
>> (fs-mail-monitor-interval-20)
>>>  Group: mail-services
>>>   Resource: postfix (class=ocf provider=heartbeat type=postfix)
>>>    Operations: start interval=0s timeout=20s (postfix-start-interval-0s)
>>>                stop interval=0s timeout=20s (postfix-stop-interval-0s)
>>>                monitor interval=45s (postfix-monitor-interval-45s)
>>>
>>> Stonith Devices:
>>> Fencing Levels:
>>>
>>> Location Constraints:
>>> Ordering Constraints:
>>>   start network-services then promote mail-clone (kind:Mandatory)
>>> (id:order-network-services-mail-clone-mandatory)
>>>   promote mail-clone then promote spool-clone (kind:Mandatory)
>>> (id:order-mail-clone-spool-clone-mandatory)
>>>   promote spool-clone then start fs-services (kind:Mandatory)
>>> (id:order-spool-clone-fs-services-mandatory)
>>>   start fs-services then start mail-services (kind:Mandatory)
>>> (id:order-fs-services-mail-services-mandatory)
>>> Colocation Constraints:
>>>   network-services with spool-clone (score:INFINITY) (rsc-role:Started)
>>> (with-rsc-role:Master)
>> (id:colocation-network-services-spool-clone-INFINITY)
>>>   network-services with mail-clone (score:INFINITY) (rsc-role:Started)
>>> (with-rsc-role:Master)
>> (id:colocation-network-services-mail-clone-INFINITY)
>>>   network-services with fs-services (score:INFINITY)
>>> (id:colocation-network-services-fs-services-INFINITY)
>>>   network-services with mail-services (score:INFINITY)
>>> (id:colocation-network-services-mail-services-INFINITY)
>>
>> I'm not sure whether it's causing your issue, but I would make the
>> constraints reflect the logical relationships better.
>>
>> For example, network-services only needs to be colocated with
>> mail-services logically; it's mail-services that needs to be with
>> fs-services, and fs-services that needs to be with
>> spool-clone/mail-clone master. In other words, don't make the
>> highest-level resource depend on everything else, make each level depend
>> on the level below it.
>>
>> Also, I would guess that the virtual IP only needs to be ordered before
>> mail-services, and mail-clone and spool-clone could both be ordered
>> before fs-services, rather than ordering mail-clone before spool-clone.
>>
>>> Resources Defaults:
>>>  migration-threshold: 3
>>> Operations Defaults:
>>>  on-fail: restart
>>>
>>> Cluster Properties:
>>>  cluster-infrastructure: corosync
>>>  cluster-name: mailcluster
>>>  cluster-recheck-interval: 5min
>>>  dc-version: 1.1.13-10.el7_2.2-44eb2dd
>>>  default-resource-stickiness: infinity
>>>  have-watchdog: false
>>>  last-lrm-refresh: 1457613674
>>>  no-quorum-policy: ignore
>>>  pe-error-series-max: 1024
>>>  pe-input-series-max: 1024
>>>  pe-warn-series-max: 1024
>>>  stonith-enabled: false
>>>
>>>
>>>
>>>
>>>
>>> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       Diff: --- 0.197.15 2
>>> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       Diff: +++ 0.197.16 (null)
>>> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       +  /cib:  @num_updates=16
>>> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       +
>>>
>> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='postfix']/lrm_rsc_op[@id='postfix_last_failure_0']:
>>>  @transition-key=4:1234:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
>>> @transition-magic=0:7;4:1234:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
>>> @call-id=1274, @last-rc-change=1457613440
>>> Mar 10 13:37:20 [7420] HWJ-626.domain.local       crmd:     info:
>>> abort_transition_graph:       Transition aborted by postfix_monitor_45000
>>> 'modify' on mail1: Inactive graph
>>> (magic=0:7;4:1234:0:ae755a85-c250-498f-9c94-ddd8a7e2788a, cib=0.197.16,
>>> source=process_graph_event:598, 1)
>>> Mar 10 13:37:20 [7420] HWJ-626.domain.local       crmd:     info:
>>> update_failcount:     Updating failcount for postfix on mail1 after
>> failed
>>> monitor: rc=7 (update=value++, time=1457613440)
>>> Mar 10 13:37:20 [7418] HWJ-626.domain.local      attrd:     info:
>>> attrd_client_update:  Expanded fail-count-postfix=value++ to 3
>>> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_process_request:  Completed cib_modify operation for section status:
>> OK
>>> (rc=0, origin=mail1/crmd/196, version=0.197.16)
>>> Mar 10 13:37:20 [7418] HWJ-626.domain.local      attrd:     info:
>>> attrd_peer_update:    Setting fail-count-postfix[mail1]: 2 -> 3 from
>> mail2
>>> Mar 10 13:37:20 [7418] HWJ-626.domain.local      attrd:     info:
>>> write_attribute:      Sent update 400 with 2 changes for
>>> fail-count-postfix, id=<n/a>, set=(null)
>>> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_process_request:  Forwarding cib_modify operation for section status
>> to
>>> master (origin=local/attrd/400)
>>> Mar 10 13:37:20 [7420] HWJ-626.domain.local       crmd:     info:
>>> process_graph_event:  Detected action (1234.4)
>>> postfix_monitor_45000.1274=not running: failed
>>> Mar 10 13:37:20 [7418] HWJ-626.domain.local      attrd:     info:
>>> attrd_peer_update:    Setting last-failure-postfix[mail1]: 1457613347 ->
>>> 1457613440 from mail2
>>> Mar 10 13:37:20 [7420] HWJ-626.domain.local       crmd:   notice:
>>> do_state_transition:  State transition S_IDLE -> S_POLICY_ENGINE [
>>> input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
>>> Mar 10 13:37:20 [7418] HWJ-626.domain.local      attrd:     info:
>>> write_attribute:      Sent update 401 with 2 changes for
>>> last-failure-postfix, id=<n/a>, set=(null)
>>> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       Diff: --- 0.197.16 2
>>> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       Diff: +++ 0.197.17 (null)
>>> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       +  /cib:  @num_updates=17
>>> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       +
>>>
>> /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-fail-count-postfix']:
>>>  @value=3
>>> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_process_request:  Completed cib_modify operation for section status:
>> OK
>>> (rc=0, origin=mail2/attrd/400, version=0.197.17)
>>> Mar 10 13:37:20 [7420] HWJ-626.domain.local       crmd:     info:
>>> abort_transition_graph:       Transition aborted by
>>> status-1-fail-count-postfix, fail-count-postfix=3: Transient attribute
>>> change (modify cib=0.197.17, source=abort_unless_down:319,
>>>
>> path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-fail-count-postfix'],
>>> 1)
>>> Mar 10 13:37:20 [7418] HWJ-626.domain.local      attrd:     info:
>>> attrd_cib_callback:   Update 400 for fail-count-postfix: OK (0)
>>> Mar 10 13:37:20 [7418] HWJ-626.domain.local      attrd:     info:
>>> attrd_cib_callback:   Update 400 for fail-count-postfix[mail1]=3: OK (0)
>>> Mar 10 13:37:20 [7418] HWJ-626.domain.local      attrd:     info:
>>> attrd_cib_callback:   Update 400 for fail-count-postfix[mail2]=(null): OK
>>> (0)
>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_process_request:  Forwarding cib_modify operation for section status
>> to
>>> master (origin=local/attrd/401)
>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       Diff: --- 0.197.17 2
>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       Diff: +++ 0.197.18 (null)
>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       +  /cib:  @num_updates=18
>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       +
>>>
>> /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-last-failure-postfix']:
>>>  @value=1457613440
>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_process_request:  Completed cib_modify operation for section status:
>> OK
>>> (rc=0, origin=mail2/attrd/401, version=0.197.18)
>>> Mar 10 13:37:21 [7418] HWJ-626.domain.local      attrd:     info:
>>> attrd_cib_callback:   Update 401 for last-failure-postfix: OK (0)
>>> Mar 10 13:37:21 [7418] HWJ-626.domain.local      attrd:     info:
>>> attrd_cib_callback:   Update 401 for
>>> last-failure-postfix[mail1]=1457613440: OK (0)
>>> Mar 10 13:37:21 [7418] HWJ-626.domain.local      attrd:     info:
>>> attrd_cib_callback:   Update 401 for
>>> last-failure-postfix[mail2]=1457610376: OK (0)
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info:
>>> abort_transition_graph:       Transition aborted by
>>> status-1-last-failure-postfix, last-failure-postfix=1457613440: Transient
>>> attribute change (modify cib=0.197.18, source=abort_unless_down:319,
>>>
>> path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-last-failure-postfix'],
>>> 1)
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
>>> unpack_config:        On loss of CCM Quorum: Ignore
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> determine_online_status:      Node mail1 is online
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> determine_online_status:      Node mail2 is online
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> determine_op_status:  Operation monitor found resource mail:0 active in
>>> master mode on mail1
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> determine_op_status:  Operation monitor found resource spool:0 active in
>>> master mode on mail1
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> determine_op_status:  Operation monitor found resource fs-spool active on
>>> mail1
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> determine_op_status:  Operation monitor found resource fs-mail active on
>>> mail1
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:  warning:
>>> unpack_rsc_op_failure:        Processing failed op monitor for postfix on
>>> mail1: not running (7)
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> determine_op_status:  Operation monitor found resource spool:1 active in
>>> master mode on mail2
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> determine_op_status:  Operation monitor found resource mail:1 active in
>>> master mode on mail2
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> group_print:   Resource Group: network-services
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> native_print:      virtualip-1        (ocf::heartbeat:IPaddr2):
>>  Started
>>> mail1
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> clone_print:   Master/Slave Set: spool-clone [spool]
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> short_print:       Masters: [ mail1 ]
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> short_print:       Slaves: [ mail2 ]
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> clone_print:   Master/Slave Set: mail-clone [mail]
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> short_print:       Masters: [ mail1 ]
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> short_print:       Slaves: [ mail2 ]
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> group_print:   Resource Group: fs-services
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> native_print:      fs-spool   (ocf::heartbeat:Filesystem):    Started
>> mail1
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> native_print:      fs-mail    (ocf::heartbeat:Filesystem):    Started
>> mail1
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> group_print:   Resource Group: mail-services
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> native_print:      postfix    (ocf::heartbeat:postfix):       FAILED
>> mail1
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> get_failcount_full:   postfix has failed 3 times on mail1
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:  warning:
>>> common_apply_stickiness:      Forcing postfix away from mail1 after 3
>>> failures (max=3)
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> master_color: Promoting mail:0 (Master mail1)
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> master_color: mail-clone: Promoted 1 instances of a possible 1 to master
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> master_color: Promoting spool:0 (Master mail1)
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> master_color: spool-clone: Promoted 1 instances of a possible 1 to master
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> rsc_merge_weights:    postfix: Rolling back scores from virtualip-1
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> native_color: Resource virtualip-1 cannot run anywhere
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> RecurringOp:   Start recurring monitor (45s) for postfix on mail2
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
>>> LogActions:   Stop    virtualip-1     (mail1)
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> LogActions:   Leave   spool:0 (Master mail1)
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> LogActions:   Leave   spool:1 (Slave mail2)
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> LogActions:   Leave   mail:0  (Master mail1)
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> LogActions:   Leave   mail:1  (Slave mail2)
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
>>> LogActions:   Stop    fs-spool        (Started mail1)
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
>>> LogActions:   Stop    fs-mail (Started mail1)
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
>>> LogActions:   Stop    postfix (Started mail1)
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
>>> process_pe_message:   Calculated Transition 1235:
>>> /var/lib/pacemaker/pengine/pe-input-302.bz2
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info:
>>> handle_response:      pe_calc calculation pe_calc-dc-1457613441-3756 is
>>> obsolete
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
>>> unpack_config:        On loss of CCM Quorum: Ignore
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> determine_online_status:      Node mail1 is online
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> determine_online_status:      Node mail2 is online
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> determine_op_status:  Operation monitor found resource mail:0 active in
>>> master mode on mail1
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> determine_op_status:  Operation monitor found resource spool:0 active in
>>> master mode on mail1
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> determine_op_status:  Operation monitor found resource fs-spool active on
>>> mail1
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> determine_op_status:  Operation monitor found resource fs-mail active on
>>> mail1
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:  warning:
>>> unpack_rsc_op_failure:        Processing failed op monitor for postfix on
>>> mail1: not running (7)
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> determine_op_status:  Operation monitor found resource spool:1 active in
>>> master mode on mail2
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> determine_op_status:  Operation monitor found resource mail:1 active in
>>> master mode on mail2
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> group_print:   Resource Group: network-services
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> native_print:      virtualip-1        (ocf::heartbeat:IPaddr2):
>>  Started
>>> mail1
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> clone_print:   Master/Slave Set: spool-clone [spool]
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> short_print:       Masters: [ mail1 ]
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> short_print:       Slaves: [ mail2 ]
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> clone_print:   Master/Slave Set: mail-clone [mail]
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> short_print:       Masters: [ mail1 ]
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> short_print:       Slaves: [ mail2 ]
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> group_print:   Resource Group: fs-services
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> native_print:      fs-spool   (ocf::heartbeat:Filesystem):    Started
>> mail1
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> native_print:      fs-mail    (ocf::heartbeat:Filesystem):    Started
>> mail1
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> group_print:   Resource Group: mail-services
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> native_print:      postfix    (ocf::heartbeat:postfix):       FAILED
>> mail1
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> get_failcount_full:   postfix has failed 3 times on mail1
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:  warning:
>>> common_apply_stickiness:      Forcing postfix away from mail1 after 3
>>> failures (max=3)
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> master_color: Promoting mail:0 (Master mail1)
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> master_color: mail-clone: Promoted 1 instances of a possible 1 to master
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> master_color: Promoting spool:0 (Master mail1)
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> master_color: spool-clone: Promoted 1 instances of a possible 1 to master
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> rsc_merge_weights:    postfix: Rolling back scores from virtualip-1
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> native_color: Resource virtualip-1 cannot run anywhere
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> RecurringOp:   Start recurring monitor (45s) for postfix on mail2
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
>>> LogActions:   Stop    virtualip-1     (mail1)
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> LogActions:   Leave   spool:0 (Master mail1)
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> LogActions:   Leave   spool:1 (Slave mail2)
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> LogActions:   Leave   mail:0  (Master mail1)
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
>>> LogActions:   Leave   mail:1  (Slave mail2)
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
>>> LogActions:   Stop    fs-spool        (Started mail1)
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
>>> LogActions:   Stop    fs-mail (Started mail1)
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
>>> LogActions:   Stop    postfix (Started mail1)
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info:
>>> do_state_transition:  State transition S_POLICY_ENGINE ->
>>> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
>>> origin=handle_response ]
>>> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
>>> process_pe_message:   Calculated Transition 1236:
>>> /var/lib/pacemaker/pengine/pe-input-303.bz2
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info:
>>> do_te_invoke: Processing graph 1236 (ref=pe_calc-dc-1457613441-3757)
>>> derived from /var/lib/pacemaker/pengine/pe-input-303.bz2
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
>>> te_rsc_command:       Initiating action 12: stop virtualip-1_stop_0 on
>> mail1
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
>>> te_rsc_command:       Initiating action 5: stop postfix_stop_0 on mail1
>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       Diff: --- 0.197.18 2
>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       Diff: +++ 0.197.19 (null)
>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       +  /cib:  @num_updates=19
>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       +
>>>
>> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='virtualip-1']/lrm_rsc_op[@id='virtualip-1_last_0']:
>>>  @operation_key=virtualip-1_stop_0, @operation=stop,
>>> @transition-key=12:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
>>> @transition-magic=0:0;12:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
>>> @call-id=1276, @last-run=1457613441, @last-rc-change=1457613441,
>>> @exec-time=66
>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_process_request:  Completed cib_modify operation for section status:
>> OK
>>> (rc=0, origin=mail1/crmd/197, version=0.197.19)
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info:
>>> match_graph_event:    Action virtualip-1_stop_0 (12) confirmed on mail1
>>> (rc=0)
>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       Diff: --- 0.197.19 2
>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       Diff: +++ 0.197.20 (null)
>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       +  /cib:  @num_updates=20
>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       +
>>>
>> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='postfix']/lrm_rsc_op[@id='postfix_last_0']:
>>>  @operation_key=postfix_stop_0, @operation=stop,
>>> @transition-key=5:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
>>> @transition-magic=0:0;5:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
>>> @call-id=1278, @last-run=1457613441, @last-rc-change=1457613441,
>>> @exec-time=476
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info:
>>> match_graph_event:    Action postfix_stop_0 (5) confirmed on mail1 (rc=0)
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
>>> te_rsc_command:       Initiating action 79: stop fs-mail_stop_0 on mail1
>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_process_request:  Completed cib_modify operation for section status:
>> OK
>>> (rc=0, origin=mail1/crmd/198, version=0.197.20)
>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       Diff: --- 0.197.20 2
>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       Diff: +++ 0.197.21 (null)
>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       +  /cib:  @num_updates=21
>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       +
>>>
>> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='fs-mail']/lrm_rsc_op[@id='fs-mail_last_0']:
>>>  @operation_key=fs-mail_stop_0, @operation=stop,
>>> @transition-key=79:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
>>> @transition-magic=0:0;79:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
>>> @call-id=1280, @last-run=1457613441, @last-rc-change=1457613441,
>>> @exec-time=88, @queue-time=1
>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_process_request:  Completed cib_modify operation for section status:
>> OK
>>> (rc=0, origin=mail1/crmd/199, version=0.197.21)
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info:
>>> match_graph_event:    Action fs-mail_stop_0 (79) confirmed on mail1
>> (rc=0)
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
>>> te_rsc_command:       Initiating action 77: stop fs-spool_stop_0 on mail1
>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       Diff: --- 0.197.21 2
>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       Diff: +++ 0.197.22 (null)
>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       +  /cib:  @num_updates=22
>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_perform_op:       +
>>>
>> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='fs-spool']/lrm_rsc_op[@id='fs-spool_last_0']:
>>>  @operation_key=fs-spool_stop_0, @operation=stop,
>>> @transition-key=77:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
>>> @transition-magic=0:0;77:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
>>> @call-id=1282, @last-run=1457613441, @last-rc-change=1457613441,
>>> @exec-time=86
>>> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_process_request:  Completed cib_modify operation for section status:
>> OK
>>> (rc=0, origin=mail1/crmd/200, version=0.197.22)
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info:
>>> match_graph_event:    Action fs-spool_stop_0 (77) confirmed on mail1
>> (rc=0)
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:  warning:
>>> run_graph:    Transition 1236 (Complete=11, Pending=0, Fired=0,
>> Skipped=0,
>>> Incomplete=1, Source=/var/lib/pacemaker/pengine/pe-input-303.bz2):
>>> Terminated
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:  warning:
>>> te_graph_trigger:     Transition failed: terminated
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
>>> print_graph:  Graph 1236 with 12 actions: batch-limit=12 jobs,
>>> network-delay=0ms
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
>>> print_synapse:        [Action   16]: Completed pseudo op
>>> network-services_stopped_0     on N/A (priority: 0, waiting: none)
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
>>> print_synapse:        [Action   15]: Completed pseudo op
>>> network-services_stop_0        on N/A (priority: 0, waiting: none)
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
>>> print_synapse:        [Action   12]: Completed rsc op virtualip-1_stop_0
>>>              on mail1 (priority: 0, waiting: none)
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
>>> print_synapse:        [Action   84]: Completed pseudo op
>>> fs-services_stopped_0          on N/A (priority: 0, waiting: none)
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
>>> print_synapse:        [Action   83]: Completed pseudo op
>> fs-services_stop_0
>>>             on N/A (priority: 0, waiting: none)
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
>>> print_synapse:        [Action   77]: Completed rsc op fs-spool_stop_0
>>>             on mail1 (priority: 0, waiting: none)
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
>>> print_synapse:        [Action   79]: Completed rsc op fs-mail_stop_0
>>>              on mail1 (priority: 0, waiting: none)
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
>>> print_synapse:        [Action   90]: Completed pseudo op
>>> mail-services_stopped_0        on N/A (priority: 0, waiting: none)
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
>>> print_synapse:        [Action   89]: Completed pseudo op
>>> mail-services_stop_0           on N/A (priority: 0, waiting: none)
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
>>> print_synapse:        [Action   86]: Pending rsc op postfix_monitor_45000
>>>             on mail2 (priority: 0, waiting: none)
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
>>> print_synapse:         * [Input 85]: Unresolved dependency rsc op
>>> postfix_start_0 on mail2
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
>>> print_synapse:        [Action    5]: Completed rsc op postfix_stop_0
>>>              on mail1 (priority: 0, waiting: none)
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
>>> print_synapse:        [Action    8]: Completed pseudo op all_stopped
>>>              on N/A (priority: 0, waiting: none)
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info: do_log:
>>>     FSA: Input I_TE_SUCCESS from notify_crmd() received in state
>>> S_TRANSITION_ENGINE
>>> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
>>> do_state_transition:  State transition S_TRANSITION_ENGINE -> S_IDLE [
>>> input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
>>> Mar 10 13:37:26 [7415] HWJ-626.domain.local        cib:     info:
>>> cib_process_ping:     Reporting our current digest to mail2:
>>> 3896ee29cdb6ba128330b0ef6e41bd79 for 0.197.22 (0x1544a30 0)




More information about the Users mailing list