[Pacemaker] large cluster - failure recovery

Radoslaw Garbacz radoslaw.garbacz at xtremedatainc.com
Wed Nov 4 18:41:53 UTC 2015


Hi,

I have a cluster of 32 nodes, and after some tuning was able to have it
started and running,
but it does not recover from a node disconnect-connect failure.
It regains quorum, but CIB does not recover to a synchronized state and
"cibadmin -Q" times out.

Is there anything with corosync or pacemaker parameters I can do to make it
recover from such a situation
(everything works for smaller clusters).

In my case it is OK for a node to disconnect (all the major resources are
shutdown)
and later reconnect the cluster (the running monitoring agent will cleanup
and restart major resources if needed),
so I do not have STONITH configured.

Details:
OS: CentOS 6
Pacemaker: Pacemaker 1.1.9-1512.el6
Corosync: Corosync Cluster Engine, version '2.3.2'


Corosync configuration:
        token: 10000
        #token_retransmits_before_loss_const: 10
        consensus: 15000
        join: 1000
        send_join: 80
        merge: 1000
        downcheck: 2000
        #rrp_problem_count_timeout: 5000
        max_network_delay: 150 # for azure


Some logs:

[...]
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
cib_process_diff:         Diff 1.9254.1 -> 1.9255.1 from local not applied
to 1.9275.1: current "epoch" is greater than required
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
update_cib_cache_cb:      [cib_diff_notify] Patch aborted: Application of
an update diff failed (-1006)
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
cib_process_diff:         Diff 1.9255.1 -> 1.9256.1 from local not applied
to 1.9275.1: current "epoch" is greater than required
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
update_cib_cache_cb:      [cib_diff_notify] Patch aborted: Application of
an update diff failed (-1006)
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
cib_process_diff:         Diff 1.9256.1 -> 1.9257.1 from local not applied
to 1.9275.1: current "epoch" is greater than required
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
update_cib_cache_cb:      [cib_diff_notify] Patch aborted: Application of
an update diff failed (-1006)
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
cib_process_diff:         Diff 1.9257.1 -> 1.9258.1 from local not applied
to 1.9275.1: current "epoch" is greater than required
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
update_cib_cache_cb:      [cib_diff_notify] Patch aborted: Application of
an update diff failed (-1006)
[...]

[...]
Nov 04 17:43:24 [12176] ip-10-109-145-175    crm_mon:    error:
cib_native_perform_op_delegate:         Couldn't perform cib_query
operation (timeout=120s): Operation already in progress (-114)
Nov 04 17:43:24 [12176] ip-10-109-145-175    crm_mon:    error:
get_cib_copy:   Couldnt retrieve the CIB
Nov 04 17:43:24 [12176] ip-10-109-145-175    crm_mon:    error:
cib_native_perform_op_delegate:         Couldn't perform cib_query
operation (timeout=120s): Operation already in progress (-114)
Nov 04 17:43:24 [12176] ip-10-109-145-175    crm_mon:    error:
get_cib_copy:   Couldnt retrieve the CIB
Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice  [QUORUM]
Members[32]: 3 27 11 29 23 21 24 9 17 12 32 13 2 10 16 15 6 28 19 1 22 26 5\
Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice  [QUORUM]
Members[32]: 14 20 31 30 8 25 18 7 4
Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice  [MAIN  ]
Completed service synchronization, ready to provide service.
Nov 04 18:06:55 [10599] ip-10-109-145-175 corosync notice  [QUORUM]
Members[32]: 3 27 11 29 23 21 24 9 17 12 32 13 2 10 16 15 6 28 19 1 22 26 5\
Nov 04 18:06:55 [10599] ip-10-109-145-175 corosync notice  [QUORUM]
Members[32]: 14 20 31 30 8 25 18 7 4
[...]

[...]
Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng:   notice:
update_cib_cache_cb:    [cib_diff_notify] Patch aborted: Application of an
update diff failed (-1006)
Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng:     info:
apply_xml_diff:         Digest mis-match: expected
01192e5118739b7c33c23f7645da3f45, calculated
f8028c0c98526179ea5df0a2ba0d09de
Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng:  warning:
cib_process_diff:       Diff 1.15046.2 -> 1.15046.3 from local not applied
to 1.15046.2: Failed application of an update diff
Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng:   notice:
update_cib_cache_cb:    [cib_diff_notify] Patch aborted: Application of an
update diff failed (-1006)
Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng:   notice:
cib_process_diff:       Diff 1.15046.2 -> 1.15046.3 from local not applied
to 1.15046.3: current "num_updates" is greater than required
[...]


ps. Sorry if should posted on corosync newsgroup, just the CIB
synchronization fails, so this group seemed to me the right place.

-- 
Best Regards,

Radoslaw Garbacz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20151104/0b80fca9/attachment-0003.html>


More information about the Pacemaker mailing list