[Pacemaker] Node in pending state, resources duplicated and data corruption

Tue Mar 18 07:04:19 EDT 2014

Andrew:

Thanks for your answer!

-- 
Lic. Gabriel Gomiz - Jefe de Sistemas / Administrador
ggomiz at cooperativaobrera.coop
Gerencia de Sistemas - Cooperativa Obrera Ltda.
Tel: (0291) 403-9700

On 03/17/2014 09:39 PM, Andrew Beekhof wrote:
> On 14 Mar 2014, at 10:32 pm, Gabriel Gomiz <ggomiz at cooperativaobrera.coop> wrote:
>
>> Hi to all!
>>
>> We've a 4 node cluster and recently experienced a weird issue with Pacemaker that resulted in three
>> database instance resources duplicated (running simultaneously in 2 nodes) and subsequent data
>> corruption.
>>
>> I've been investigating logs and could not arrive to a conclusion as to why did that happened. So
>> I'm writing to the list with details of the event to see if someone can help me pinpoint if there
>> was some problem with our operation of maybe we hit some bug.
>>
>> OS: CentOS 6.4
>> Pacemaker version: 1.1.8
> 6.4 should have 1.1.10 available by now.

Yes, we are planning an upgrade to CentOS 6.5 and 1.1.10. But before that we wanted to investigate
this issue.

> - So, we decided to delete the operation monitor from the resource that caused the timeout (a LVM
> Volume Group) to avoid having that same problem again. We did 'crm configure edit' and deleted two
> op monitor, and then just after commiting that modification to the crm we experienced the problem of
> resource duplication:
>
> Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: Diff: --- 0.526.3
> Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: Diff: +++ 0.527.1 ebf608fdccda78e755a61cdeea18a659
> Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: --             <op
> id="vg_ifx_oltp-monitor-240" interval="240" name="monitor" timeout="120" />
> Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: --             <op id="vg_ifx_dss-monitor-240"
> interval="240" name="monitor" timeout="120" />
> Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: ++ <cib epoch="527" num_updates="1"
> admin_epoch="0" validate-with="pacemaker-1.2" crm_feature_set="3.0.7"
> update-origin="mordor.san01.cooperativaobrera.coop" update-client="crmd" cib-last-written="Sat Mar 
> 8 08:58:40 2014" have-quorum="1" dc-uuid="lorien.san01.cooperativaobrera.coop" />
> Mar  8 09:13:52 lorien crmd[9934]:   notice: do_state_transition: State transition S_IDLE ->
> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Mar  8 09:13:52 lorien crmd[9934]:   notice: do_state_transition: State transition S_ELECTION ->
> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ]
> Mar  8 09:13:52 lorien crmd[9934]:   notice: do_dc_takeover: Marking
> gandalf.san01.cooperativaobrera.coop, target of a previous stonith action, as clean
> ^^^ This is significant. 
>
> The problem is triggered by crmsh performing a replace operation - which causes an election.
> When the election is completed, the DC checks the stonith_cleanup_list (which evidently contains old data) and erases gandalf's status section (causing everyone to think its safely offline).
>
> Do you see anything like:
>
>> Mar  8 08:43:22 mordor crmd[26216]:   notice: tengine_stonith_notify: Target may have been our
>> leader gandalf.san01.cooperativaobrera.coop (recorded: <unset>)
> on lorien?

Yes, 8:43 we have:

Mar  8 08:43:21 lorien corosync[9135]:   [QUORUM] Members[3]: 2 3 4
Mar  8 08:43:21 lorien corosync[9135]:   [TOTEM ] A processor joined or left the membership and a
new membership was formed.
Mar  8 08:43:21 lorien crmd[9934]:   notice: crm_update_peer_state: cman_event_callback: Node
gandalf.san01.cooperativaobrera.coop[1] - state is now lost
Mar  8 08:43:21 lorien crmd[9934]:  warning: check_dead_member: Our DC node
(gandalf.san01.cooperativaobrera.coop) left the cluster
Mar  8 08:43:21 lorien kernel: dlm: closing connection to node 1
Mar  8 08:43:21 lorien corosync[9135]:   [CPG   ] chosen downlist: sender r(0) ip(172.16.1.1) r(1)
ip(172.16.2.1) ; members(old:4 left:1)
Mar  8 08:43:21 lorien corosync[9135]:   [MAIN  ] Completed service synchronization, ready to
provide service.
Mar  8 08:43:21 lorien fenced[9186]: fencing deferred to isildur.san01.cooperativaobrera.coop
Mar  8 08:43:22 lorien crmd[9934]:   notice: do_state_transition: State transition S_NOT_DC ->
S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=check_dead_member ]
Mar  8 08:43:22 lorien crmd[9934]:   notice: do_state_transition: State transition S_ELECTION ->
S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ]
Mar  8 08:43:22 lorien stonith-ng[9930]:   notice: remote_op_done: Operation reboot of
gandalf.san01.cooperativaobrera.coop by mordor.san01.cooperativaobrera.coop for
crmd.31561 at gandalf.san01.cooperativaobrera.coop.10d27664: OK
Mar  8 08:43:22 lorien cib[9929]:  warning: cib_process_diff: Diff 0.513.82 -> 0.513.83 from
mordor.san01.cooperativaobrera.coop not applied to 0.513.84: current "num_updates" is greater than
required
Mar  8 08:43:22 lorien cib[9929]:  warning: cib_process_diff: Diff 0.513.83 -> 0.513.84 from
mordor.san01.cooperativaobrera.coop not applied to 0.513.84: current "num_updates" is greater than
required
Mar  8 08:43:22 lorien cib[9929]:   notice: cib_process_diff: Diff 0.513.84 -> 0.513.85 from
mordor.san01.cooperativaobrera.coop not applied to 0.513.84: Failed application of an update diff
Mar  8 08:43:22 lorien cib[9929]:  warning: cib_server_process_diff: Not requesting full refresh in
R/W mode
Mar  8 08:43:22 lorien cib[9929]:  warning: cib_diff_notify: Update (client: crmd, call:256):
0.513.84 -> 0.513.85 (Application of an update diff failed)
Mar  8 08:43:22 lorien crmd[9934]:   notice: tengine_stonith_notify: Peer
gandalf.san01.cooperativaobrera.coop was terminated (st_notify_fence) by
mordor.san01.cooperativaobrera.coop for gandalf.san01.cooperativaobrera.coop: OK
(ref=10d27664-33ed-43e0-a5bd-7d0ef850eb05) by client crmd.31561
Mar  8 08:43:22 lorien crmd[9934]:   notice: tengine_stonith_notify: Notified CMAN that
'gandalf.san01.cooperativaobrera.coop' is now fenced
Mar  8 08:43:22 lorien crmd[9934]:   notice: tengine_stonith_notify: Target may have been our leader
gandalf.san01.cooperativaobrera.coop (recorded: <unset>)

> Strange that there was no DC at the time gandalf was fenced.  
> Do you have a log file configured too? Those have more detail than syslog.
Yes we have corosync log file with this, for example:

Mar 08 08:43:21 corosync [QUORUM] Members[3]: 2 3 4
Mar 08 08:43:21 corosync [TOTEM ] A processor joined or left the membership and a new membership was
formed.
Mar 08 08:43:21 [9934] lorien.cooperativaobrera.com.ar       crmd:     info:
cman_event_callback:       Membership 5732: quorum retained
Mar 08 08:43:21 [9934] lorien.cooperativaobrera.com.ar       crmd:   notice:
crm_update_peer_state:     cman_event_callback: Node gandalf.san01.cooperativaobrera.coop[1] - state
is now lost
Mar 08 08:43:21 [9934] lorien.cooperativaobrera.com.ar       crmd:     info:
peer_update_callback:      gandalf.san01.cooperativaobrera.coop is now lost (was member)
Mar 08 08:43:21 [9934] lorien.cooperativaobrera.com.ar       crmd:  warning:
check_dead_member:         Our DC node (gandalf.san01.cooperativaobrera.coop) left the cluster
Mar 08 08:43:21 [9934] lorien.cooperativaobrera.com.ar       crmd:     info: send_ais_text:     Peer
overloaded or membership in flux: Re-sending message (Attempt 1 of 20)

Maybe, this is significant : 'Our DC node (gandalf.san01.cooperativaobrera.coop) left the cluster' ... ?

Please tell me if you need more details:

> I have also made the following commit which may also help:
>    https://github.com/beekhof/pacemaker/commit/cef2446
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 555 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140318/b34bd18a/attachment-0003.sig>