[Pacemaker] Master-slave master not promoted on Corosync restart
Sékine Coulibaly
scoulibaly at gmail.com
Fri Sep 26 08:37:36 UTC 2014
Hi everyone,
I'm trying my best to diagnose a strange behaviour of my cluster.
My cluster is basically a Master-Slave PostgreSQL cluster, with a VIP.
Two nodes (clustera and clusterb). I'm running RHEL 6.5, Corosync
1.4.1-1 and Pacemaker 1.1.10.
For the simplicity sake of the diagnostic, I took of the slave node.
My problem is that the cluster properly promotes the POSTGRESQL
resource once (I issue a resource cleanup MS_POSTGRESQL to reset
failcount counter, and then all resources are mounted on clustera).
After a Corosync restart, the POSTGRESQL resource is not promoted.
I narrowed down to the point where I add a location constraint
(without this location constraint, after a Corosync restart,
POSTGRESQL resource is promoted):
location VIP_MGT_needs_gw VIP_MGT rule -inf: not_defined pingd or pingd lte 0
The logs show that the pingd attribute value is 1000 (the ping IP is
pingable, and pinged [used tcpdump]). This attribute is set by :
primitive ping_eth1_mgt_gw ocf:pacemaker:ping params
host_list=178.3.1.47 multiplier=1000 op monitor interval=10s meta
migration-threshold=3
>From corosync.log I can see :
Sep 26 09:49:36 [22188] clustera pengine: notice: LogActions:
Start POSTGRESQL:0 (clustera)
Sep 26 09:49:36 [22188] clustera pengine: info: LogActions:
Leave POSTGRESQL:1 (Stopped)
[...]
Sep 26 09:49:36 [22186] clustera lrmd: info: log_execute:
executing - rsc:POSTGRESQL action:start call_id:20
[...]
Sep 26 09:49:37 [22187] clustera attrd: notice:
attrd_trigger_update: Sending flush op to all hosts for:
master-POSTGRESQL (50)
[...]
Sep 26 09:49:37 [22189] clustera crmd: info:
match_graph_event: Action POSTGRESQL_notify_0 (46) confirmed on
clustera (rc=0)
[...]
Sep 26 09:49:38 [22186] clustera lrmd: info: log_finished:
finished - rsc:ping_eth1_mgt_gw action:start call_id:22 pid:22352
exit-code:0 exec-time:2175ms queue-time:0ms
[...]
Sep 26 09:49:38 [22188] clustera pengine: info: clone_print:
Master/Slave Set: MS_POSTGRESQL [POSTGRESQL]
Sep 26 09:49:38 [22188] clustera pengine: info: short_print:
Slaves: [ clustera ]
Sep 26 09:49:38 [22188] clustera pengine: info: short_print:
Stopped: [ clusterb ]
Sep 26 09:49:38 [22188] clustera pengine: info: native_print:
VIP_MGT (ocf::heartbeat:IPaddr2): Stopped
Sep 26 09:49:38 [22188] clustera pengine: info: clone_print:
Clone Set: cloned_ping_eth1_mgt_gw [ping_eth1_mgt_gw]
Sep 26 09:49:38 [22188] clustera pengine: info: short_print:
Started: [ clustera ]
Sep 26 09:49:38 [22188] clustera pengine: info: short_print:
Stopped: [ clusterb ]
Sep 26 09:49:38 [22188] clustera pengine: info:
rsc_merge_weights: VIP_MGT: Rolling back scores from
MS_POSTGRESQL
Sep 26 09:49:38 [22188] clustera pengine: info: native_color:
Resource VIP_MGT cannot run anywhere
Sep 26 09:49:38 [22188] clustera pengine: info: native_color:
POSTGRESQL:1: Rolling back scores from VIP_MGT
Sep 26 09:49:38 [22188] clustera pengine: info: native_color:
Resource POSTGRESQL:1 cannot run anywhere
Sep 26 09:49:38 [22188] clustera pengine: info: master_color:
MS_POSTGRESQL: Promoted 0 instances of a possible 1 to master
Sep 26 09:49:38 [22188] clustera pengine: info: native_color:
Resource ping_eth1_mgt_gw:1 cannot run anywhere
Sep 26 09:49:38 [22188] clustera pengine: info: RecurringOp:
Start recurring monitor (60s) for POSTGRESQL:0 on clustera
Sep 26 09:49:38 [22188] clustera pengine: info: RecurringOp:
Start recurring monitor (60s) for POSTGRESQL:0 on clustera
Sep 26 09:49:38 [22188] clustera pengine: info: RecurringOp:
Start recurring monitor (10s) for ping_eth1_mgt_gw:0 on clustera
Sep 26 09:49:38 [22188] clustera pengine: info: LogActions:
Leave POSTGRESQL:0 (Slave clustera)
Sep 26 09:49:38 [22188] clustera pengine: info: LogActions:
Leave POSTGRESQL:1 (Stopped)
Sep 26 09:49:38 [22188] clustera pengine: info: LogActions:
Leave VIP_MGT (Stopped)
Sep 26 09:49:38 [22188] clustera pengine: info: LogActions:
Leave ping_eth1_mgt_gw:0 (Started clustera)
Sep 26 09:49:38 [22188] clustera pengine: info: LogActions:
Leave ping_eth1_mgt_gw:1 (Stopped)
[...]
Then everything goes weird. The POSTGRESQL is monitored, seen as
master (rc=8). Since it is expected (???) to be OCF_RUNNING, the
monitor operation failed. That's wierd is n't it, or am I missing
something ? :
Sep 26 09:49:38 [22189] clustera crmd: info:
do_state_transition: State transition S_POLICY_ENGINE ->
S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
origin=handle_response ]
Sep 26 09:49:38 [22189] clustera crmd: info: do_te_invoke:
Processing graph 1 (ref=pe_calc-dc-1411717778-16) derived from
/var/lib/pacemaker/pengine/pe-input-1057.bz2
Sep 26 09:49:38 [22189] clustera crmd: notice: te_rsc_command:
Initiating action 6: monitor POSTGRESQL_monitor_60000 on clustera
(local)
Sep 26 09:49:38 [22189] clustera crmd: info: do_lrm_rsc_op:
Performing key=6:1:0:613bb344-94ce-4210-879c-6d6e90f35871
op=POSTGRESQL_monitor_60000
Sep 26 09:49:38 [22189] clustera crmd: notice: te_rsc_command:
Initiating action 33: monitor ping_eth1_mgt_gw_monitor_10000 on
clustera (local)
Sep 26 09:49:38 [22189] clustera crmd: info: do_lrm_rsc_op:
Performing key=33:1:0:613bb344-94ce-4210-879c-6d6e90f35871
op=ping_eth1_mgt_gw_monitor_10000
Sep 26 09:49:38 [22184] clustera cib: info: crm_client_new:
Connecting 0x1bbcc00 for uid=0 gid=0 pid=22767
id=0ef393a5-6f7b-4893-9b06-0fab26d9a06d
Sep 26 09:49:38 [22184] clustera cib: info:
cib_process_request: Completed cib_query operation for section
nodes: OK (rc=0, origin=local/crm_attribute/2, version=0.692.9)
Sep 26 09:49:38 [22187] clustera attrd: notice:
attrd_trigger_update: Sending flush op to all hosts for:
master-POSTGRESQL (100)
Sep 26 09:49:38 [22184] clustera cib: info:
cib_process_request: Completed cib_query operation for section
//cib/status//node_state[@id='clustera']//transient_attributes//nvpair[@name='master-POSTGRESQL']:
OK (rc=0, origin=local/attrd/8, version=0.692.9)
Sep 26 09:49:38 [22184] clustera cib: info:
crm_client_destroy: Destroying 0 events
Sep 26 09:49:38 [22187] clustera attrd: notice:
attrd_perform_update: Sent update 9: master-POSTGRESQL=100
Sep 26 09:49:38 [22184] clustera cib: info:
cib_process_request: Completed cib_modify operation for section
status: OK (rc=0, origin=local/attrd/9, version=0.692.10)
Sep 26 09:49:38 [22189] clustera crmd: info:
abort_transition_graph: te_update_diff:172 - Triggered transition
abort (complete=0, node=clustera, tag=nvpair,
id=status-clustera-master-POSTGRESQL, name=master-POSTGRESQL,
value=100, magic=NA, cib=0.692.10) : Transient attribute: update
Sep 26 09:49:38 [22189] clustera crmd: notice:
process_lrm_event: LRM operation POSTGRESQL_monitor_60000
(call=29, rc=8, cib-update=33, confirmed=false) master
Sep 26 09:49:38 [22184] clustera cib: info:
cib_process_request: Completed cib_modify operation for section
status: OK (rc=0, origin=local/crmd/33, version=0.692.11)
Sep 26 09:49:38 [22189] clustera crmd: warning: status_from_rc:
Action 6 (POSTGRESQL_monitor_60000) on clustera failed (target: 0 vs.
rc: 8): Error
Sep 26 09:49:38 [22189] clustera crmd: warning:
update_failcount: Updating failcount for POSTGRESQL on clustera
after failed monitor: rc=8 (update=value++, time=1411717778)
Sep 26 09:49:38 [22189] clustera crmd: info:
abort_transition_graph: match_graph_event:313 - Triggered transition
abort (complete=0, node=clustera, tag=lrm_rsc_op,
id=POSTGRESQL_last_failure_0,
magic=0:8;6:1:0:613bb344-94ce-4210-879c-6d6e90f35871, cib=0.692.11) :
Event failed
Sep 26 09:49:38 [22189] clustera crmd: info:
match_graph_event: Action POSTGRESQL_monitor_60000 (6) confirmed
on clustera (rc=4)
Sep 26 09:49:38 [22189] clustera crmd: warning:
update_failcount: Updating failcount for POSTGRESQL on clustera
after failed monitor: rc=8 (update=value++, time=1411717778)
Sep 26 09:49:38 [22189] clustera crmd: info:
process_graph_event: Detected action (1.6)
POSTGRESQL_monitor_60000.29=master: failed
Sep 26 09:49:38 [22187] clustera attrd: notice:
attrd_trigger_update: Sending flush op to all hosts for:
fail-count-POSTGRESQL (1)
Sep 26 09:49:38 [22184] clustera cib: info:
cib_process_request: Completed cib_query operation for section
//cib/status//node_state[@id='clustera']//transient_attributes//nvpair[@name='fail-count-POSTGRESQL']:
No such device or address (rc=-6, origin=local/attrd/10,
version=0.692.11)
Sep 26 09:49:38 [22184] clustera cib: info:
cib_process_request: Completed cib_query operation for section
/cib: OK (rc=0, origin=local/attrd/11, version=0.692.11)
Sep 26 09:49:38 [22187] clustera attrd: notice:
attrd_perform_update: Sent update 12: fail-count-POSTGRESQL=1
Attached are the bz2 files and the corosync.log.
Any help is appreciated !
Sekine
More information about the Pacemaker
mailing list