[Pacemaker] Master/Slave not failing over

Mon Jun 28 19:05:52 EDT 2010

As an update on this: I modified my monitoring action and it now determines master/slave status at the moment the action triggers. This fixed the light monitoring for all cases I have been able to test thus far. However, it still causes a problem on the medium monitoring. The interval on the medium monitoring when it is a slave is 11,000 ms. When it is a master, it is 7,000 ms. I am getting cases where the medium monitoring starts with the 11,000 ms interval at nearly the same time that a promote occurs. The monitoring task is suspended or delayed in some fashion just long enough for the promote to execute some code that then causes the monitoring task to flag the state as MASTER. However, since the monitoring task was started as a slave, the system expects it to return slave status. However, it returns master status instead, so it marks the node as failed and causes the system to fail over again. Determining the state of the system (master or slave) is the very first thing I do in the monitoring section, yet it can still be delayed until after the promote.

Jun 28 09:50:28 node-2 lrmd: [2934]: info: perform_op:2873: operation monitor[407] on ocf::frs::frs:0 for client 2937, its parame
ters: profile_start_delay=[45] CRM_meta_notify_active_resource=[ ] CRM_meta_notify_stop_resource=[ ] CRM_meta_notify_inactive_resource=[frs:0 ] ips=[ ] CRM_meta_notify_start_resource=[frs:0 ] OCF_CHECK_LEVEL=[10] CRM_meta_notify_active_uname=[ ] fsroot=[] managetype=[screen] CRM_meta_globally_unique=[false] stop_delay=[1 for rsc is already running.
Jun 28 09:50:28 node-2 lrmd: [2934]: info: perform_op:2883: postponing all ops on resource frs:0 by 1000 ms
Jun 28 09:50:29 node-2 frs[907]: INFO: Monitor medium returned: OCF_SUCCESS for SLAVE status
Jun 28 09:50:29 node-2 frs[1262]: INFO: Light monitoring of the frs system completed: RUNNING_SLAVE
Jun 28 09:50:32 node-2 crm_resource: [1554]: info: Invoked: crm_resource --meta -r frs-MS -p target-role -v Master
Jun 28 09:50:33 node-2 lrmd: [2934]: info: rsc:frs:0:533: promote
<snip>
...
</snip>
Jun 28 09:50:42 node-2 frs[2567]: INFO: Monitor medium returned: OCF_RUNNING_MASTER for MASTER status
Jun 28 09:50:42 node-2 crmd: [2937]: info: process_lrm_event: LRM operation frs:0_monitor_11000 (call=407, rc=8, cib-update=694, confirmed=false) master
Jun 28 09:50:42 node-2 attrd: [2935]: info: attrd_local_callback: Expanded fail-count-frs:0=value++ to 19
Jun 28 09:50:42 node-2 attrd: [2935]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-frs:0 (19)
Jun 28 09:50:42 node-2 attrd: [2935]: info: attrd_perform_update: Sent update 1366: fail-count-frs:0=19
Jun 28 09:50:42 node-2 attrd: [2935]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-frs:0 (1277736969)
Jun 28 09:50:42 node-2 attrd: [2935]: info: attrd_perform_update: Sent update 1368: last-failure-frs:0=1277736969
Jun 28 09:50:42 node-2 frs[2991]: INFO: Light monitoring of the frs system completed: RUNNING_MASTER

The only real fix to this I can see is to alter the way that lrmd/crmd function so that the return status must match the current expected state of the cluster, not the state that existed when the monitor action was launched. Does anyone else have any other suggestions I can try?

Eliot Gable
Senior Product Developer
1228 Euclid Ave, Suite 390
Cleveland, OH 44115

Direct: 216-373-4808
Fax: 216-373-4657
egable at broadvox.net

CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom it is addressed. If you are not the intended recipient, please call me immediately.  BROADVOX is a registered trademark of Broadvox, LLC.

-----Original Message-----
From: Eliot Gable [mailto:egable at broadvox.com]
Sent: Friday, June 25, 2010 3:14 PM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Master/Slave not failing over

I modified my resource to set "migration-threshold=1" and "failure-timeout=5s". Now the resource is finally switching to Master on the slave node when the original master fails. However, shortly after it switches to Master, it reports FAILED_MASTER and fails back over. Looking at the logs, I see the following events take place:

A monitor medium action is started on node-1 as a Slave
Demote occurs on node-2 as a Master
Stop occurs on node-2 as a Master
Promote occurs on node-1
Promote finishes on node-1 and it is now a Master
The monitor medium action finishes and returns RUNNING_MASTER, even though it was started as a slave. It gets reported as a monitoring failure, and node-1 gets marked as FAILED_MASTER.

Here is the tail end of these actions:

Jun 25 11:06:34 node-1 frs[29700]: INFO: Light monitoring of the frs system completed: RUNNING_MASTER
Jun 25 11:06:34 node-1 crmd: [3005]: info: process_lrm_event: LRM operation frs:1_monitor_11000 (call=396, rc=8, cib-update=1146, confirmed=false)
master
Jun 25 11:06:34 node-1 crmd: [3005]: info: process_graph_event: Detected action frs:1_monitor_11000 from a different transition: 185 vs. 255
Jun 25 11:06:34 node-1 crmd: [3005]: info: abort_transition_graph: process_graph_event:462 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=Fr
eeSWITCH:1_monitor_11000, magic=0:8;49:185:0:621ad9c9-5546-4e6d-9d7b-fb09e9034320, cib=0.476.273) : Old event
Jun 25 11:06:34 node-1 crmd: [3005]: WARN: update_failcount: Updating failcount for frs:1 on node-1 after failed monitor: rc=8 (update=value++,
time=1277478394)
Jun 25 11:06:34 node-1 attrd: [3003]: info: attrd_local_callback: Expanded fail-count-frs:1=value++ to 6
Jun 25 11:06:34 node-1 attrd: [3003]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-frs:1 (6)
Jun 25 11:06:34 node-1 attrd: [3003]: info: attrd_perform_update: Sent update 1790: fail-count-frs:1=6
Jun 25 11:06:34 node-1 attrd: [3003]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-frs:1 (1277478394)
Jun 25 11:06:34 node-1 attrd: [3003]: info: attrd_perform_update: Sent update 1792: last-failure-frs:1=1277478394

The odd part is that you can see that the system understands it is from a previous state and tries to abort it, yet it still counts it as a failure.

I can probably fix this in my resource agent by determining the current expected state of the node right from the start (master or slave) and then performing the other monitoring actions. Since the other monitoring actions take some time to complete, they are currently delaying the determination of master/slave status until after it has switched to a Master. Although, right now, it is arguably doing things correctly, because it is reporting the current state as it exists at the time of returning from the monitoring action.

Eliot Gable
Senior Product Developer
1228 Euclid Ave, Suite 390
Cleveland, OH 44115

Direct: 216-373-4808
Fax: 216-373-4657
egable at broadvox.net

CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom it is addressed. If you are not the intended recipient, please call me immediately.  BROADVOX is a registered trademark of Broadvox, LLC.

-----Original Message-----
From: Eliot Gable [mailto:egable at broadvox.com]
Sent: Friday, June 25, 2010 1:45 PM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Master/Slave not failing over

When I issue the 'ip addr flush eth1' command on the Master node (node-2), it detects the failure of network resources that my resource agent monitors. Then I get alerts from my RA for the following actions:

frs ERROR on node-2 at Fri Jun 25 08:14:44 2010 EDT: 192.168.3.4/24 is in a failed state on eth1 on Master node.

frs Master Node Failure Alert on node-2 at Fri Jun 25 08:14:44 2010 EDT: IP Address 192.168.3.4/24 no longer exists on eth1. Furthermore, eth1 has ARP enabled, so it looks like node-2 is supposed to be the master node.

frs Demote Action Taken on node-2 at Fri Jun 25 08:14:46 2010 EDT
frs Demote Completed Successfully on node-2 at Fri Jun 25 08:14:46 2010 EDT
frs Stop Action Taken on node-2 at Fri Jun 25 08:14:46 2010 EDT
frs Stop Completed Successfully on node-2 at Fri Jun 25 08:14:51 2010 EDT
frs Start Action Taken on node-2 at Fri Jun 25 08:14:52 2010 EDT
frs Start Completed Successfully on node-2 at Fri Jun 25 08:15:19 2010 EDT
frs Promote Action Taken on node-2 at Fri Jun 25 08:15:20 2010 EDT
frs Promote Completed Successfully on node-2 at Fri Jun 25 08:15:21 2010 EDT

During the failure, if I dump the CIB, I see this:

      <transient_attributes id="node-2">
        <instance_attributes id="status-node-2">
          <nvpair id="status-node-2-probe_complete" name="probe_complete" value="true"/>
          <nvpair id="status-node-2-fail-count-frs:0" name="fail-count-frs:0" value="5"/>
          <nvpair id="status-node-2-last-failure-frs:0" name="last-failure-frs:0" value="1277472008"/>
        </instance_attributes>
      </transient_attributes>
    </node_state>
    <node_state uname="node-1" ha="active" in_ccm="true" crmd="online" join="member" shutdown="0" expected="member" id="node-1" crm-debug-or
igin="do_update_resource">
      <transient_attributes id="node-1">
        <instance_attributes id="status-node-1">
          <nvpair id="status-node-1-probe_complete" name="probe_complete" value="true"/>
          <nvpair id="status-node-1-last-failure-frs:1" name="last-failure-frs:1" value="1277469246"/>
          <nvpair id="status-node-1-fail-count-frs:1" name="fail-count-frs:1" value="8"/>
          <nvpair id="status-node-1-master-frs" name="master-frs" value="100"/>
        </instance_attributes>
      </transient_attributes>

This indicates that the $CRM_MASTER -D really is deleting the master-frs resource on node-2. I definitely see it there with the same value of 100 when it is active. The node-1 resource remains a Slave the entire time.

I have no constraints or order commands in place that might be blocking it. In the log file, during the failover, I see this:

Jun 25 08:26:02 node-2 lrmd: [2934]: info: rsc:frs:0:91: stop
Jun 25 08:26:02 node-2 crm_attribute: [27185]: info: Invoked: crm_attribute -N node-2 -n master-frs -l reboot -D
Jun 25 08:26:02 node-2 attrd: [2935]: info: attrd_trigger_update: Sending flush op to all hosts for: master-frs (<null>)
Jun 25 08:26:02 node-2 attrd: [2935]: info: attrd_perform_update: Sent delete 213: node=node-2, attr=master-frs, id=<n/a>, set=(null), section=status
Jun 25 08:26:02 node-2 attrd: [2935]: info: attrd_perform_update: Sent delete 215: node=node-2, attr=master-frs, id=<n/a>, set=(null), section=status
Jun 25 08:26:06 node-2 frs[27178]: INFO: Done stopping.
Jun 25 08:26:06 node-2 crmd: [2937]: info: process_lrm_event: LRM operation frs:0_stop_0 (call=91, rc=0, cib-update=111, confirmed=true) ok
Jun 25 08:26:07 node-2 crmd: [2937]: info: do_lrm_rsc_op: Performing key=41:117:0:621ad9c9-5546-4e6d-9d7b-fb09e9034320 op=frs:0_start_0 )
Jun 25 08:26:07 node-2 lrmd: [2934]: info: rsc:frs:0:92: start
Jun 25 08:26:34 node-2 crm_attribute: [30836]: info: Invoked: crm_attribute -N node-2 -n master-frs -l reboot -v 100
Jun 25 08:26:34 node-2 attrd: [2935]: info: attrd_trigger_update: Sending flush op to all hosts for: master-frs (100)
Jun 25 08:26:34 node-2 attrd: [2935]: info: attrd_perform_update: Sent update 218: master-frs=100
Jun 25 08:26:34 node-2 crmd: [2937]: info: process_lrm_event: LRM operation frs:0_start_0 (call=92, rc=0, cib-update=112, confirmed=true) ok
Jun 25 08:26:34 node-2 crmd: [2937]: info: do_lrm_rsc_op: Performing key=100:117:0:621ad9c9-5546-4e6d-9d7b-fb09e9034320 op=frs:0_notify_0 )
Jun 25 08:26:34 node-2 lrmd: [2934]: info: rsc:frs:0:93: notify
Jun 25 08:26:34 node-2 crmd: [2937]: info: process_lrm_event: LRM operation frs:0_notify_0 (call=93, rc=0, cib-update=113, confirmed=true) ok
Jun 25 08:26:36 node-2 crmd: [2937]: info: do_lrm_rsc_op: Performing key=106:118:0:621ad9c9-5546-4e6d-9d7b-fb09e9034320 op=frs:0_notify_0 )
Jun 25 08:26:36 node-2 lrmd: [2934]: info: rsc:frs:0:94: notify
Jun 25 08:26:36 node-2 crmd: [2937]: info: process_lrm_event: LRM operation frs:0_notify_0 (call=94, rc=0, cib-update=114, confirmed=true) ok
Jun 25 08:26:36 node-2 crmd: [2937]: info: do_lrm_rsc_op: Performing key=42:118:0:621ad9c9-5546-4e6d-9d7b-fb09e9034320 op=frs:0_promote_0 )
Jun 25 08:26:36 node-2 lrmd: [2934]: info: rsc:frs:0:95: promote

On the other node, I see this:

Jun 25 09:35:12 node-1 frs[2431]: INFO: Light monitoring of the frs system completed: RUNNING_SLAVE
Jun 25 09:35:14 node-1 frs[2791]: INFO: Light monitoring of the frs system completed: RUNNING_SLAVE
Jun 25 09:35:16 node-1 crmd: [3005]: info: process_graph_event: Action frs:0_monitor_7000 arrived after a completed transition
Jun 25 09:35:16 node-1 crmd: [3005]: info: abort_transition_graph: process_graph_event:467 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=frs:0_monitor_7000, magic=0:9;44:118:8:621ad9c9-5546-4e6d-9d7b-fb09e9034320, cib=0.468.100) : Inactive graph
Jun 25 09:35:16 node-1 crmd: [3005]: WARN: update_failcount: Updating failcount for frs:0 on node-2 after failed monitor: rc=9 (update=value++, time=1277472916)
Jun 25 09:35:16 node-1 crmd: [3005]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Jun 25 09:35:16 node-1 crmd: [3005]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
Jun 25 09:35:16 node-1 crmd: [3005]: info: do_pe_invoke: Query 698: Requesting the current CIB: S_POLICY_ENGINE
Jun 25 09:35:16 node-1 crmd: [3005]: info: abort_transition_graph: te_update_diff:146 - Triggered transition abort (complete=1, tag=transient_attributes, id=node-2, magic=NA, cib=0.468.101) : Transient attribute: update
Jun 25 09:35:16 node-1 crmd: [3005]: info: do_pe_invoke: Query 699: Requesting the current CIB: S_POLICY_ENGINE
Jun 25 09:35:16 node-1 crmd: [3005]: info: abort_transition_graph: te_update_diff:146 - Triggered transition abort (complete=1, tag=transient_attributes, id=node-2, magic=NA, cib=0.468.102) : Transient attribute: update
Jun 25 09:35:16 node-1 crmd: [3005]: info: do_pe_invoke: Query 700: Requesting the current CIB: S_POLICY_ENGINE
Jun 25 09:35:16 node-1 crmd: [3005]: info: do_pe_invoke_callback: Invoking the PE: query=700, ref=pe_calc-dc-1277472916-541, seq=3616, quorate=1
Jun 25 09:35:16 node-1 pengine: [3004]: notice: unpack_config: On loss of CCM Quorum: Ignore
Jun 25 09:35:16 node-1 pengine: [3004]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Jun 25 09:35:16 node-1 pengine: [3004]: info: determine_online_status: Node node-2 is online
Jun 25 09:35:16 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing failed op frs:0_monitor_7000 on node-2: master (failed) (9)
Jun 25 09:35:16 node-1 pengine: [3004]: info: determine_online_status: Node node-1 is online
Jun 25 09:35:16 node-1 pengine: [3004]: info: find_clone: Internally renamed Apache:0 on node-1 to Apache:1
Jun 25 09:35:16 node-1 pengine: [3004]: info: find_clone: Internally renamed pgpool-II:0 on node-1 to pgpool-II:1
Jun 25 09:35:16 node-1 pengine: [3004]: info: find_clone: Internally renamed Postfix:0 on node-1 to Postfix:1
Jun 25 09:35:16 node-1 pengine: [3004]: info: find_clone: Internally renamed MySQL:0 on node-1 to MySQL:1
Jun 25 09:35:16 node-1 pengine: [3004]: info: find_clone: Internally renamed frs:0 on node-1 to frs:1
Jun 25 09:35:16 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing failed op frs:1_monitor_3000 on node-1: unknown error (1)
Jun 25 09:35:16 node-1 pengine: [3004]: info: find_clone: Internally renamed PostgreSQL:0 on node-1 to PostgreSQL:2 (ORPHAN)
Jun 25 09:35:16 node-1 pengine: [3004]: notice: clone_print:  Clone Set: Apache-Clone
Jun 25 09:35:16 node-1 pengine: [3004]: notice: short_print:      Started: [ node-2 node-1 ]
Jun 25 09:35:16 node-1 pengine: [3004]: notice: clone_print:  Clone Set: pgpool-II-Clone
Jun 25 09:35:16 node-1 pengine: [3004]: notice: short_print:      Started: [ node-2 node-1 ]
Jun 25 09:35:16 node-1 pengine: [3004]: notice: clone_print:  Clone Set: MySQL-Clone
Jun 25 09:35:16 node-1 pengine: [3004]: notice: short_print:      Started: [ node-2 node-1 ]
Jun 25 09:35:16 node-1 pengine: [3004]: notice: clone_print:  Master/Slave Set: frs-MS
Jun 25 09:35:16 node-1 pengine: [3004]: notice: native_print:      frs:0      (ocf::broadvox:frs):     Master node-2 FAILED
Jun 25 09:35:16 node-1 pengine: [3004]: notice: short_print:      Slaves: [ node-1 ]
Jun 25 09:35:16 node-1 pengine: [3004]: notice: clone_print:  Clone Set: Postfix-Clone
Jun 25 09:35:16 node-1 pengine: [3004]: notice: short_print:      Started: [ node-2 node-1 ]
Jun 25 09:35:16 node-1 pengine: [3004]: notice: clone_print:  Clone Set: PostgreSQL-Clone
Jun 25 09:35:16 node-1 pengine: [3004]: notice: short_print:      Started: [ node-2 node-1 ]
Jun 25 09:35:16 node-1 pengine: [3004]: info: get_failcount: frs-MS has failed 8 times on node-2
Jun 25 09:35:16 node-1 pengine: [3004]: notice: common_apply_stickiness: frs-MS can fail 999992 more times on node-2 before being forced off
Jun 25 09:35:16 node-1 pengine: [3004]: info: get_failcount: frs-MS has failed 8 times on node-1
Jun 25 09:35:16 node-1 pengine: [3004]: notice: common_apply_stickiness: frs-MS can fail 999992 more times on node-1 before being forced off
Jun 25 09:35:16 node-1 pengine: [3004]: info: master_color: Promoting frs:0 (Master node-2)
Jun 25 09:35:16 node-1 pengine: [3004]: info: master_color: frs-MS: Promoted 1 instances of a possible 1 to master
Jun 25 09:35:16 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring monitor (3s) for frs:0 on node-2
Jun 25 09:35:16 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring monitor (7s) for frs:0 on node-2
Jun 25 09:35:16 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring monitor (3s) for frs:0 on node-2
Jun 25 09:35:16 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring monitor (7s) for frs:0 on node-2
Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource Apache:0  (Started node-2)
Jun 25 09:35:16 node-1 crmd: [3005]: info: process_graph_event: Action frs:0_monitor_3000 arrived after a completed transition
Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource Apache:1  (Started node-1)
Jun 25 09:35:16 node-1 crmd: [3005]: info: abort_transition_graph: process_graph_event:467 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=frs:0_monitor_3000, magic=0:9;43:118:8:621ad9c9-5546-4e6d-9d7b-fb09e9034320, cib=0.468.103) : Inactive graph
Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource pgpool-II:0       (Started node-2)
Jun 25 09:35:16 node-1 crmd: [3005]: WARN: update_failcount: Updating failcount for frs:0 on node-2 after failed monitor: rc=9 (update=value++, time=1277472916)
Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource pgpool-II:1       (Started node-1)
Jun 25 09:35:16 node-1 crmd: [3005]: info: do_pe_invoke: Query 701: Requesting the current CIB: S_POLICY_ENGINE
Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource MySQL:0   (Started node-2)
Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource MySQL:1   (Started node-1)
Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Recover resource frs:0    (Master node-2)
Jun 25 09:35:16 node-1 crmd: [3005]: info: do_pe_invoke_callback: Invoking the PE: query=701, ref=pe_calc-dc-1277472916-542, seq=3616, quorate=1
Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource frs:1      (Slave node-1)
Jun 25 09:35:16 node-1 crmd: [3005]: info: abort_transition_graph: te_update_diff:146 - Triggered transition abort (complete=1, tag=transient_attributes, id=node-2, magic=NA, cib=0.468.104) : Transient attribute: update
Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource Postfix:0 (Started node-2)
Jun 25 09:35:16 node-1 crmd: [3005]: info: do_pe_invoke: Query 702: Requesting the current CIB: S_POLICY_ENGINE
Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource Postfix:1 (Started node-1)
Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource PostgreSQL:0      (Started node-2)
Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource PostgreSQL:1      (Started node-1)
Jun 25 09:35:16 node-1 crmd: [3005]: info: do_pe_invoke_callback: Invoking the PE: query=702, ref=pe_calc-dc-1277472916-543, seq=3616, quorate=1
Jun 25 09:35:17 node-1 crmd: [3005]: info: handle_response: pe_calc calculation pe_calc-dc-1277472916-541 is obsolete
Jun 25 09:35:17 node-1 pengine: [3004]: info: process_pe_message: Transition 119: PEngine Input stored in: /var/lib/pengine/pe-input-633.bz2
Jun 25 09:35:17 node-1 pengine: [3004]: notice: unpack_config: On loss of CCM Quorum: Ignore
Jun 25 09:35:17 node-1 pengine: [3004]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Jun 25 09:35:17 node-1 pengine: [3004]: info: determine_online_status: Node node-2 is online
Jun 25 09:35:17 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing failed op frs:0_monitor_3000 on node-2: master (failed) (9)
Jun 25 09:35:17 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing failed op frs:0_monitor_7000 on node-2: master (failed) (9)
Jun 25 09:35:17 node-1 pengine: [3004]: info: determine_online_status: Node node-1 is online
Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed Apache:0 on node-1 to Apache:1
Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed pgpool-II:0 on node-1 to pgpool-II:1
Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed Postfix:0 on node-1 to Postfix:1
Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed MySQL:0 on node-1 to MySQL:1
Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed frs:0 on node-1 to frs:1
Jun 25 09:35:17 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing failed op frs:1_monitor_3000 on node-1: unknown error (1)
Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed PostgreSQL:0 on node-1 to PostgreSQL:2 (ORPHAN)
Jun 25 09:35:17 node-1 pengine: [3004]: notice: clone_print:  Clone Set: Apache-Clone
Jun 25 09:35:17 node-1 pengine: [3004]: notice: short_print:      Started: [ node-2 node-1 ]
Jun 25 09:35:17 node-1 pengine: [3004]: notice: clone_print:  Clone Set: pgpool-II-Clone
Jun 25 09:35:17 node-1 pengine: [3004]: notice: short_print:      Started: [ node-2 node-1 ]
Jun 25 09:35:17 node-1 pengine: [3004]: notice: clone_print:  Clone Set: MySQL-Clone
Jun 25 09:35:17 node-1 pengine: [3004]: notice: short_print:      Started: [ node-2 node-1 ]
Jun 25 09:35:17 node-1 pengine: [3004]: notice: clone_print:  Master/Slave Set: frs-MS
Jun 25 09:35:17 node-1 pengine: [3004]: notice: native_print:      frs:0      (ocf::broadvox:frs):     Master node-2 FAILED
Jun 25 09:35:17 node-1 pengine: [3004]: notice: short_print:      Slaves: [ node-1 ]
Jun 25 09:35:17 node-1 pengine: [3004]: notice: clone_print:  Clone Set: Postfix-Clone
Jun 25 09:35:17 node-1 pengine: [3004]: notice: short_print:      Started: [ node-2 node-1 ]
Jun 25 09:35:17 node-1 pengine: [3004]: notice: clone_print:  Clone Set: PostgreSQL-Clone
Jun 25 09:35:17 node-1 pengine: [3004]: notice: short_print:      Started: [ node-2 node-1 ]
Jun 25 09:35:17 node-1 pengine: [3004]: info: get_failcount: frs-MS has failed 8 times on node-2
Jun 25 09:35:17 node-1 pengine: [3004]: notice: common_apply_stickiness: frs-MS can fail 999992 more times on node-2 before being forced off
Jun 25 09:35:17 node-1 pengine: [3004]: info: get_failcount: frs-MS has failed 8 times on node-1
Jun 25 09:35:17 node-1 pengine: [3004]: notice: common_apply_stickiness: frs-MS can fail 999992 more times on node-1 before being forced off
Jun 25 09:35:17 node-1 pengine: [3004]: info: master_color: Promoting frs:0 (Master node-2)
Jun 25 09:35:17 node-1 pengine: [3004]: info: master_color: frs-MS: Promoted 1 instances of a possible 1 to master
Jun 25 09:35:17 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring monitor (3s) for frs:0 on node-2
Jun 25 09:35:17 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring monitor (7s) for frs:0 on node-2
Jun 25 09:35:17 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring monitor (3s) for frs:0 on node-2
Jun 25 09:35:17 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring monitor (7s) for frs:0 on node-2
Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource Apache:0  (Started node-2)
Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource Apache:1  (Started node-1)
Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource pgpool-II:0       (Started node-2)
Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource pgpool-II:1       (Started node-1)
Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource MySQL:0   (Started node-2)
Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource MySQL:1   (Started node-1)
Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Recover resource frs:0    (Master node-2)
Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource frs:1      (Slave node-1)
Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource Postfix:0 (Started node-2)
Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource Postfix:1 (Started node-1)
Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource PostgreSQL:0      (Started node-2)
Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource PostgreSQL:1      (Started node-1)
Jun 25 09:35:17 node-1 crmd: [3005]: info: handle_response: pe_calc calculation pe_calc-dc-1277472916-542 is obsolete
Jun 25 09:35:17 node-1 pengine: [3004]: info: process_pe_message: Transition 120: PEngine Input stored in: /var/lib/pengine/pe-input-634.bz2
Jun 25 09:35:17 node-1 pengine: [3004]: notice: unpack_config: On loss of CCM Quorum: Ignore
Jun 25 09:35:17 node-1 pengine: [3004]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Jun 25 09:35:17 node-1 pengine: [3004]: info: determine_online_status: Node node-2 is online
Jun 25 09:35:17 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing failed op frs:0_monitor_3000 on node-2: master (failed) (9)
Jun 25 09:35:17 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing failed op frs:0_monitor_7000 on node-2: master (failed) (9)
Jun 25 09:35:17 node-1 pengine: [3004]: info: determine_online_status: Node node-1 is online
Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed Apache:0 on node-1 to Apache:1
Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed pgpool-II:0 on node-1 to pgpool-II:1
Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed Postfix:0 on node-1 to Postfix:1
Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed MySQL:0 on node-1 to MySQL:1
Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed frs:0 on node-1 to frs:1
Jun 25 09:35:17 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing failed op frs:1_monitor_3000 on node-1: unknown error (1)
Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed PostgreSQL:0 on node-1 to PostgreSQL:2 (ORPHAN)
Jun 25 09:35:18 node-1 pengine: [3004]: notice: clone_print:  Clone Set: Apache-Clone
Jun 25 09:35:18 node-1 pengine: [3004]: notice: short_print:      Started: [ node-2 node-1 ]
Jun 25 09:35:18 node-1 pengine: [3004]: notice: clone_print:  Clone Set: pgpool-II-Clone
Jun 25 09:35:18 node-1 pengine: [3004]: notice: short_print:      Started: [ node-2 node-1 ]
Jun 25 09:35:18 node-1 pengine: [3004]: notice: clone_print:  Clone Set: MySQL-Clone
Jun 25 09:35:18 node-1 pengine: [3004]: notice: short_print:      Started: [ node-2 node-1 ]
Jun 25 09:35:18 node-1 pengine: [3004]: notice: clone_print:  Master/Slave Set: frs-MS
Jun 25 09:35:18 node-1 pengine: [3004]: notice: native_print:      frs:0      (ocf::broadvox:frs):     Master node-2 FAILED
Jun 25 09:35:18 node-1 pengine: [3004]: notice: short_print:      Slaves: [ node-1 ]
Jun 25 09:35:18 node-1 pengine: [3004]: notice: clone_print:  Clone Set: Postfix-Clone
Jun 25 09:35:18 node-1 pengine: [3004]: notice: short_print:      Started: [ node-2 node-1 ]
Jun 25 09:35:18 node-1 pengine: [3004]: notice: clone_print:  Clone Set: PostgreSQL-Clone
Jun 25 09:35:18 node-1 pengine: [3004]: notice: short_print:      Started: [ node-2 node-1 ]
Jun 25 09:35:18 node-1 pengine: [3004]: info: get_failcount: frs-MS has failed 9 times on node-2
Jun 25 09:35:18 node-1 pengine: [3004]: notice: common_apply_stickiness: frs-MS can fail 999991 more times on node-2 before being forced off
Jun 25 09:35:18 node-1 pengine: [3004]: info: get_failcount: frs-MS has failed 8 times on node-1
Jun 25 09:35:18 node-1 pengine: [3004]: notice: common_apply_stickiness: frs-MS can fail 999992 more times on node-1 before being forced off
Jun 25 09:35:18 node-1 pengine: [3004]: info: master_color: Promoting frs:0 (Master node-2)
Jun 25 09:35:18 node-1 pengine: [3004]: info: master_color: frs-MS: Promoted 1 instances of a possible 1 to master
Jun 25 09:35:18 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring monitor (3s) for frs:0 on node-2
Jun 25 09:35:18 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring monitor (7s) for frs:0 on node-2
Jun 25 09:35:18 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring monitor (3s) for frs:0 on node-2
Jun 25 09:35:18 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring monitor (7s) for frs:0 on node-2
Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource Apache:0  (Started node-2)
Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource Apache:1  (Started node-1)
Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource pgpool-II:0       (Started node-2)
Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource pgpool-II:1       (Started node-1)
Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource MySQL:0   (Started node-2)
Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource MySQL:1   (Started node-1)
Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Recover resource frs:0    (Master node-2)
Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource frs:1      (Slave node-1)
Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource Postfix:0 (Started node-2)
Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource Postfix:1 (Started node-1)
Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource PostgreSQL:0      (Started node-2)
Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource PostgreSQL:1      (Started node-1)
Jun 25 09:35:18 node-1 crmd: [3005]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Jun 25 09:35:18 node-1 crmd: [3005]: info: unpack_graph: Unpacked transition 121: 45 actions in 45 synapses
Jun 25 09:35:18 node-1 crmd: [3005]: info: do_te_invoke: Processing graph 121 (ref=pe_calc-dc-1277472916-543) derived from /var/lib/pengine/pe-input-635.bz2
Jun 25 09:35:18 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 70 fired and confirmed
Jun 25 09:35:18 node-1 crmd: [3005]: info: te_rsc_command: Initiating action 108: notify frs:0_pre_notify_demote_0 on node-2
Jun 25 09:35:18 node-1 pengine: [3004]: info: process_pe_message: Transition 121: PEngine Input stored in: /var/lib/pengine/pe-input-635.bz2
Jun 25 09:35:18 node-1 crmd: [3005]: info: te_rsc_command: Initiating action 110: notify frs:1_pre_notify_demote_0 on node-1 (local)
Jun 25 09:35:18 node-1 crmd: [3005]: info: do_lrm_rsc_op: Performing key=110:121:0:621ad9c9-5546-4e6d-9d7b-fb09e9034320 op=frs:1_notify_0 )
Jun 25 09:35:18 node-1 lrmd: [3002]: info: rsc:frs:1:313: notify
Jun 25 09:35:18 node-1 crmd: [3005]: info: match_graph_event: Action frs:0_pre_notify_demote_0 (108) confirmed on node-2 (rc=0)
Jun 25 09:35:18 node-1 crmd: [3005]: info: process_lrm_event: LRM operation frs:1_notify_0 (call=313, rc=0, cib-update=703, confirmed=true) ok
Jun 25 09:35:18 node-1 crmd: [3005]: info: match_graph_event: Action frs:1_pre_notify_demote_0 (110) confirmed on node-1 (rc=0)
Jun 25 09:35:18 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 71 fired and confirmed
Jun 25 09:35:18 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 68 fired and confirmed
Jun 25 09:35:18 node-1 crmd: [3005]: info: te_rsc_command: Initiating action 44: demote frs:0_demote_0 on node-2
Jun 25 09:35:18 node-1 crmd: [3005]: info: match_graph_event: Action frs:0_demote_0 (44) confirmed on node-2 (rc=0)
Jun 25 09:35:18 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 69 fired and confirmed
Jun 25 09:35:18 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 72 fired and confirmed
Jun 25 09:35:18 node-1 crmd: [3005]: info: te_rsc_command: Initiating action 109: notify frs:0_post_notify_demote_0 on node-2
Jun 25 09:35:18 node-1 crmd: [3005]: info: te_rsc_command: Initiating action 111: notify frs:1_post_notify_demote_0 on node-1 (local)
Jun 25 09:35:18 node-1 crmd: [3005]: info: do_lrm_rsc_op: Performing key=111:121:0:621ad9c9-5546-4e6d-9d7b-fb09e9034320 op=frs:1_notify_0 )
Jun 25 09:35:18 node-1 lrmd: [3002]: info: rsc:frs:1:314: notify
Jun 25 09:35:18 node-1 crmd: [3005]: info: process_lrm_event: LRM operation frs:1_notify_0 (call=314, rc=0, cib-update=704, confirmed=true) ok
Jun 25 09:35:18 node-1 crmd: [3005]: info: match_graph_event: Action frs:1_post_notify_demote_0 (111) confirmed on node-1 (rc=0)
Jun 25 09:35:18 node-1 crmd: [3005]: info: match_graph_event: Action frs:0_post_notify_demote_0 (109) confirmed on node-2 (rc=0)
Jun 25 09:35:18 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 73 fired and confirmed
Jun 25 09:35:18 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 58 fired and confirmed
Jun 25 09:35:18 node-1 crmd: [3005]: info: te_rsc_command: Initiating action 101: notify frs:0_pre_notify_stop_0 on node-2
Jun 25 09:35:19 node-1 crmd: [3005]: info: te_rsc_command: Initiating action 102: notify frs:1_pre_notify_stop_0 on node-1 (local)
Jun 25 09:35:19 node-1 crmd: [3005]: info: do_lrm_rsc_op: Performing key=102:121:0:621ad9c9-5546-4e6d-9d7b-fb09e9034320 op=frs:1_notify_0 )
Jun 25 09:35:19 node-1 lrmd: [3002]: info: rsc:frs:1:315: notify
Jun 25 09:35:19 node-1 crmd: [3005]: info: match_graph_event: Action frs:0_pre_notify_stop_0 (101) confirmed on node-2 (rc=0)
Jun 25 09:35:19 node-1 crmd: [3005]: info: process_lrm_event: LRM operation frs:1_notify_0 (call=315, rc=0, cib-update=705, confirmed=true) ok
Jun 25 09:35:19 node-1 crmd: [3005]: info: match_graph_event: Action frs:1_pre_notify_stop_0 (102) confirmed on node-1 (rc=0)
Jun 25 09:35:19 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 59 fired and confirmed
Jun 25 09:35:19 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 56 fired and confirmed
Jun 25 09:35:19 node-1 crmd: [3005]: info: te_rsc_command: Initiating action 7: stop frs:0_stop_0 on node-2
Jun 25 09:35:19 node-1 crmd: [3005]: info: abort_transition_graph: te_update_diff:157 - Triggered transition abort (complete=0, tag=transient_attributes, id=node-2, magic=NA, cib=0.468.112) : Transient attribute: removal
Jun 25 09:35:19 node-1 crmd: [3005]: info: update_abort_priority: Abort priority upgraded from 0 to 1000000
Jun 25 09:35:19 node-1 crmd: [3005]: info: update_abort_priority: Abort action done superceeded by restart
Jun 25 09:35:21 node-1 frs[3339]: INFO: Light monitoring of the frs system completed: RUNNING_SLAVE

I did notice this line:

Jun 25 09:35:17 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing failed op frs:1_monitor_3000 on node-1: unknown error (1)

Which makes me think maybe this is related to this failed operator from yesterday. However, I have stopped and started the resource several times on node-1 since this failed op occurred. Do I need to clear these things (cleanup the resource) each time I start the resource?

Eliot Gable
Senior Product Developer
1228 Euclid Ave, Suite 390
Cleveland, OH 44115

Direct: 216-373-4808
Fax: 216-373-4657
egable at broadvox.net

CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom it is addressed. If you are not the intended recipient, please call me immediately.  BROADVOX is a registered trademark of Broadvox, LLC.

-----Original Message-----
From: Eliot Gable [mailto:egable at broadvox.com]
Sent: Friday, June 25, 2010 1:08 PM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Master/Slave not failing over

Ok; I'm still not having any luck with this....

In my START action, right before I return $OCF_SUCCESS, I do:

$CRM_MASTER -v 100

Where CRM_MASTER is defined as in the drbd resource with '-l reboot'. In my STOP action, right at the beginning of it, I do:

$CRM_MASTER -D

I copied the new RA to both nodes, stopped the resource on both nodes, started it, promoted it, then caused the monitoring action to fail. I see Pacemaker generating the usual string of actions, including a STOP, so it should be deleting the master preference. However, no failover occurs. The slave still just sits there as a slave. Is there something else I am missing?

Thanks again.

Eliot Gable
Senior Product Developer
1228 Euclid Ave, Suite 390
Cleveland, OH 44115

Direct: 216-373-4808
Fax: 216-373-4657
egable at broadvox.net

CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom it is addressed. If you are not the intended recipient, please call me immediately.  BROADVOX is a registered trademark of Broadvox, LLC.

-----Original Message-----
From: Eliot Gable [mailto:egable at broadvox.com]
Sent: Friday, June 25, 2010 12:27 PM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Master/Slave not failing over

After looking at the drbd master/slave RA, I think it is now clear. It looks like crm_master, being a wrapper for crm_attribute, actually specifies everything I need, and all I need to add to the command line are the few additional options like lifetime of the attribute modification, value to set it to, or whether to delete the attribute.

So, if I delete the attribute when a STOP is issued and keep the attribute's lifetime set to "reboot", it should be sufficient to cause a failover, correct?

Also, I am thinking that in my START action, after I have performed enough monitoring on it to ensure that everything came up correctly, I should at that point issue crm_master again with -v option to set a score for the node so it is a good candidate to become master, correct?

Eliot Gable
Senior Product Developer
1228 Euclid Ave, Suite 390
Cleveland, OH 44115

Direct: 216-373-4808
Fax: 216-373-4657
egable at broadvox.net

CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom it is addressed. If you are not the intended recipient, please call me immediately.  BROADVOX is a registered trademark of Broadvox, LLC.

-----Original Message-----
From: Eliot Gable [mailto:egable at broadvox.com]
Sent: Friday, June 25, 2010 12:17 PM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Master/Slave not failing over

Thanks. Should I update my RA to use crm_master when it detects the resource in FAILED_MASTER state, or should I put it in the demote action or something else?

What's the command line needed to "reduce the promotion score"? I looked at the Pacemaker_Explained.pdf document, and while it mentions using crm_master to provide a promotion score, it does not tell me what actual attribute it is that needs to be modified. Is there another command that can print out all available attributes, or a document somewhere that lists them?

Eliot Gable
Senior Product Developer
1228 Euclid Ave, Suite 390
Cleveland, OH 44115

Direct: 216-373-4808
Fax: 216-373-4657
egable at broadvox.net

CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom it is addressed. If you are not the intended recipient, please call me immediately.  BROADVOX is a registered trademark of Broadvox, LLC.

-----Original Message-----
From: Andrew Beekhof [mailto:andrew at beekhof.net]
Sent: Friday, June 25, 2010 8:26 AM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Master/Slave not failing over

On Fri, Jun 25, 2010 at 12:43 AM, Eliot Gable <egable at broadvox.com> wrote:
> Thanks for pointing that out.
>
> I am still having issues with the master/slave resource. When I cause one of the monitoring actions to fail,

as well as failing it should also use crm_master to reduce the promotion score

> the master node gets a DEMOTE, STOP, START, PROMOTE and the slave resource just sits there. I want to see DEMOTE on the failed master, then PROMOTE on the slave, then STOP on the failed master, followed by START on the failed master.

The stop will always happen before the promote. Regardless of which
instance is being promoted.

> How can I achieve this? Is there some sort of constraint or something I can put in place to make it happen?
>
> Thanks again for any insights.
>
>
>
> Eliot Gable
> Senior Product Developer
> 1228 Euclid Ave, Suite 390
> Cleveland, OH 44115
>
> Direct: 216-373-4808
> Fax: 216-373-4657
> egable at broadvox.net
>
>
> CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom it is addressed. If you are not the intended recipient, please call me immediately.  BROADVOX is a registered trademark of Broadvox, LLC.
>
>
> -----Original Message-----
> From: Dejan Muhamedagic [mailto:dejanmm at fastmail.fm]
> Sent: Thursday, June 24, 2010 12:37 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Master/Slave not failing over
>
> Hi,
>
> On Thu, Jun 24, 2010 at 12:12:34PM -0400, Eliot Gable wrote:
>> On another note, I cannot seem to get Pacemaker to monitor the master node. It monitors the slave node just fine. These are the operations I have defined:
>>
>>         op monitor interval="5" timeout="30s" \
>>         op monitor interval="10" timeout="30s" OCF_CHECK_LEVEL="10" \
>>         op monitor interval="5" role="Master" timeout="30s" \
>>         op monitor interval="10" role="Master" timeout="30s" OCF_CHECK_LEVEL="10" \
>>         op start interval="0" timeout="40s" \
>>         op stop interval="0" timeout="20s"
>>
>> Did I do something wrong?
>
> Yes, all monitor intervals have to be different. I don't know
> what happened without looking at the logs, but you should set sth
> like this:
>
>         op monitor interval="6" role="Master" timeout="30s" \
>         op monitor interval="11" role="Master" timeout="30s" OCF_CHECK_LEVEL="10" \
>
> Thanks,
>
> Dejan
>
>> Eliot Gable
>> Senior Product Developer
>> 1228 Euclid Ave, Suite 390
>> Cleveland, OH 44115
>>
>> Direct: 216-373-4808
>> Fax: 216-373-4657
>> egable at broadvox.net<mailto:egable at broadvox.net>
>>
>> [cid:image001.gif at 01CB1396.87214DC0]
>> CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom it is addressed. If you are not the intended recipient, please call me immediately.  BROADVOX is a registered trademark of Broadvox, LLC.
>>
>> From: Eliot Gable [mailto:egable at broadvox.com]
>> Sent: Thursday, June 24, 2010 11:55 AM
>> To: The Pacemaker cluster resource manager
>> Subject: [Pacemaker] Master/Slave not failing over
>>
>> I am using the latest CentOS 5.5 packages for pacemaker/corosync. I have a master/slave resource up and running, and when I make the master fail, instead of immediately promoting the slave, it restarts the failed master and re-promotes it back to master. This takes longer than if it would just immediately promote the slave. I can understand it waiting for a DEMOTE action to succeed on the failed master before it promotes the slave, but that is all it should need to do it. Is there any way I can change this behavior? Am I missing some key point in the process?
>>
>>
>> Eliot Gable
>> Senior Product Developer
>> 1228 Euclid Ave, Suite 390
>> Cleveland, OH 44115
>>
>> Direct: 216-373-4808
>> Fax: 216-373-4657
>> egable at broadvox.net<mailto:egable at broadvox.net>
>>
>> [cid:image001.gif at 01CB1396.87214DC0]
>> CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom it is addressed. If you are not the intended recipient, please call me immediately.  BROADVOX is a registered trademark of Broadvox, LLC.
>>
>>
>> ________________________________
>> CONFIDENTIAL. This e-mail and any attached files are confidential and should be destroyed and/or returned if you are not the intended and proper recipient.
>>
>> ________________________________
>> CONFIDENTIAL. This e-mail and any attached files are confidential and should be destroyed and/or returned if you are not the intended and proper recipient.
>
>
>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
> CONFIDENTIAL.  This e-mail and any attached files are confidential and should be destroyed and/or returned if you are not the intended and proper recipient.
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker