[Pacemaker] large cluster - failure recovery
Radoslaw Garbacz
radoslaw.garbacz at xtremedatainc.com
Wed Nov 4 18:41:53 UTC 2015
Hi,
I have a cluster of 32 nodes, and after some tuning was able to have it
started and running,
but it does not recover from a node disconnect-connect failure.
It regains quorum, but CIB does not recover to a synchronized state and
"cibadmin -Q" times out.
Is there anything with corosync or pacemaker parameters I can do to make it
recover from such a situation
(everything works for smaller clusters).
In my case it is OK for a node to disconnect (all the major resources are
shutdown)
and later reconnect the cluster (the running monitoring agent will cleanup
and restart major resources if needed),
so I do not have STONITH configured.
Details:
OS: CentOS 6
Pacemaker: Pacemaker 1.1.9-1512.el6
Corosync: Corosync Cluster Engine, version '2.3.2'
Corosync configuration:
token: 10000
#token_retransmits_before_loss_const: 10
consensus: 15000
join: 1000
send_join: 80
merge: 1000
downcheck: 2000
#rrp_problem_count_timeout: 5000
max_network_delay: 150 # for azure
Some logs:
[...]
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
cib_process_diff: Diff 1.9254.1 -> 1.9255.1 from local not applied
to 1.9275.1: current "epoch" is greater than required
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of
an update diff failed (-1006)
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
cib_process_diff: Diff 1.9255.1 -> 1.9256.1 from local not applied
to 1.9275.1: current "epoch" is greater than required
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of
an update diff failed (-1006)
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
cib_process_diff: Diff 1.9256.1 -> 1.9257.1 from local not applied
to 1.9275.1: current "epoch" is greater than required
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of
an update diff failed (-1006)
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
cib_process_diff: Diff 1.9257.1 -> 1.9258.1 from local not applied
to 1.9275.1: current "epoch" is greater than required
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of
an update diff failed (-1006)
[...]
[...]
Nov 04 17:43:24 [12176] ip-10-109-145-175 crm_mon: error:
cib_native_perform_op_delegate: Couldn't perform cib_query
operation (timeout=120s): Operation already in progress (-114)
Nov 04 17:43:24 [12176] ip-10-109-145-175 crm_mon: error:
get_cib_copy: Couldnt retrieve the CIB
Nov 04 17:43:24 [12176] ip-10-109-145-175 crm_mon: error:
cib_native_perform_op_delegate: Couldn't perform cib_query
operation (timeout=120s): Operation already in progress (-114)
Nov 04 17:43:24 [12176] ip-10-109-145-175 crm_mon: error:
get_cib_copy: Couldnt retrieve the CIB
Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice [QUORUM]
Members[32]: 3 27 11 29 23 21 24 9 17 12 32 13 2 10 16 15 6 28 19 1 22 26 5\
Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice [QUORUM]
Members[32]: 14 20 31 30 8 25 18 7 4
Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice [MAIN ]
Completed service synchronization, ready to provide service.
Nov 04 18:06:55 [10599] ip-10-109-145-175 corosync notice [QUORUM]
Members[32]: 3 27 11 29 23 21 24 9 17 12 32 13 2 10 16 15 6 28 19 1 22 26 5\
Nov 04 18:06:55 [10599] ip-10-109-145-175 corosync notice [QUORUM]
Members[32]: 14 20 31 30 8 25 18 7 4
[...]
[...]
Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: notice:
update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an
update diff failed (-1006)
Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: info:
apply_xml_diff: Digest mis-match: expected
01192e5118739b7c33c23f7645da3f45, calculated
f8028c0c98526179ea5df0a2ba0d09de
Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: warning:
cib_process_diff: Diff 1.15046.2 -> 1.15046.3 from local not applied
to 1.15046.2: Failed application of an update diff
Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: notice:
update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an
update diff failed (-1006)
Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: notice:
cib_process_diff: Diff 1.15046.2 -> 1.15046.3 from local not applied
to 1.15046.3: current "num_updates" is greater than required
[...]
ps. Sorry if should posted on corosync newsgroup, just the CIB
synchronization fails, so this group seemed to me the right place.
--
Best Regards,
Radoslaw Garbacz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20151104/0b80fca9/attachment-0003.html>
More information about the Pacemaker
mailing list