[Pacemaker] Node in pending state, resources duplicated and data corruption

Mon Mar 17 20:39:11 EDT 2014

On 14 Mar 2014, at 10:32 pm, Gabriel Gomiz <ggomiz at cooperativaobrera.coop> wrote:

> Hi to all!
> 
> We've a 4 node cluster and recently experienced a weird issue with Pacemaker that resulted in three
> database instance resources duplicated (running simultaneously in 2 nodes) and subsequent data
> corruption.
> 
> I've been investigating logs and could not arrive to a conclusion as to why did that happened. So
> I'm writing to the list with details of the event to see if someone can help me pinpoint if there
> was some problem with our operation of maybe we hit some bug.
> 
> OS: CentOS 6.4
> Pacemaker version: 1.1.8

6.4 should have 1.1.10 available by now.

> Stack: cman
> Stonith enabled via DELL iDRAC in all 4 nodes
> Nodes: gandalf, isildur, mordor, lorien
> 
> Timeline of events and logs:
> 
> - A resource monitor operation times out and resources in that node (gandalf) are being stopped
> 
> Mar  8 08:41:09 gandalf crmd[31561]:    error: process_lrm_event: LRM operation
> vg_ifx_oltp_monitor_240000 (594) Timed Out (timeout=120000ms)
> 
> - Stopping resources in that node (gandalf) also times out and node is being killed by stonith from
> other node (mordor)

sounds reasonable so far

> 
> Mar  8 08:42:54 gandalf lrmd[31558]:  warning: child_timeout_callback: vg_ifx_oltp_stop_0 process
> (PID 17816) timed out
> ....
> Mar  8 08:42:55 gandalf pengine[31560]:  warning: unpack_rsc_op: Processing failed op stop for
> vg_ifx_oltp on gandalf.san01.cooperativaobrera.coop: unknown error (1)
> Mar  8 08:42:55 gandalf pengine[31560]:  warning: pe_fence_node: Node
> gandalf.san01.cooperativaobrera.coop will be fenced to recover from resource failure(s)
> 
> Mar  8 08:43:09 mordor corosync[25977]:   [TOTEM ] A processor failed, forming new configuration.
> Mar  8 08:43:09 mordor stonith-ng[26212]:   notice: log_operation: Operation 'reboot' [4612] (call 0
> from crmd.31561) for host 'gandalf.san01.cooperativaobrera.coop' with device 'st-gandalf' returned:
> 0 (OK)
> Mar  8 08:43:21 mordor corosync[25977]:   [QUORUM] Members[3]: 2 3 4
> Mar  8 08:43:21 mordor corosync[25977]:   [TOTEM ] A processor joined or left the membership and a
> new membership was formed.
> Mar  8 08:43:21 mordor crmd[26216]:   notice: crm_update_peer_state: cman_event_callback: Node
> gandalf.san01.cooperativaobrera.coop[1] - state is now lost
> Mar  8 08:43:21 mordor crmd[26216]:  warning: check_dead_member: Our DC node
> (gandalf.san01.cooperativaobrera.coop) left the cluster
> Mar  8 08:43:21 mordor kernel: dlm: closing connection to node 1
> Mar  8 08:43:21 mordor corosync[25977]:   [CPG   ] chosen downlist: sender r(0) ip(172.16.1.1) r(1)
> ip(172.16.2.1) ; members(old:4 left:1)
> Mar  8 08:43:21 mordor corosync[25977]:   [MAIN  ] Completed service synchronization, ready to
> provide service.
> Mar  8 08:43:21 mordor fenced[26045]: fencing deferred to isildur.san01.cooperativaobrera.coop
> Mar  8 08:43:22 mordor crmd[26216]:   notice: do_state_transition: State transition S_NOT_DC ->
> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=check_dead_member ]
> Mar  8 08:43:22 mordor crmd[26216]:   notice: do_state_transition: State transition S_ELECTION ->
> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]
> Mar  8 08:43:22 mordor stonith-ng[26212]:   notice: remote_op_done: Operation reboot of
> gandalf.san01.cooperativaobrera.coop by mordor.san01.cooperativaobrera.coop for
> crmd.31561 at gandalf.san01.cooperativaobrera.coop.10d27664: OK
> Mar  8 08:43:22 mordor crmd[26216]:   notice: tengine_stonith_notify: Peer
> gandalf.san01.cooperativaobrera.coop was terminated (st_notify_fence) by
> mordor.san01.cooperativaobrera.coop for gandalf.san01.cooperativaobrera.coop: OK
> (ref=10d27664-33ed-43e0-a5bd-7d0ef850eb05) by client crmd.31561
> Mar  8 08:43:22 mordor crmd[26216]:   notice: tengine_stonith_notify: Notified CMAN that
> 'gandalf.san01.cooperativaobrera.coop' is now fenced
> Mar  8 08:43:22 mordor crmd[26216]:   notice: tengine_stonith_notify: Target may have been our
> leader gandalf.san01.cooperativaobrera.coop (recorded: <unset>)
> Mar  8 08:43:22 mordor cib[26211]:  warning: cib_process_diff: Diff 0.513.82 -> 0.513.83 from
> lorien.san01.cooperativaobrera.coop not applied to 0.513.85: current "num_updates" is greater than
> required
> Mar  8 08:43:22 mordor cib[26211]:  warning: cib_process_diff: Diff 0.513.83 -> 0.513.84 from
> lorien.san01.cooperativaobrera.coop not applied to 0.513.85: current "num_updates" is greater than
> required
> Mar  8 08:43:22 mordor cib[26211]:  warning: cib_process_diff: Diff 0.513.84 -> 0.513.85 from
> lorien.san01.cooperativaobrera.coop not applied to 0.513.85: current "num_updates" is greater than
> required
> Mar  8 08:43:22 mordor cib[26211]:   notice: cib_process_diff: Diff 0.513.85 -> 0.513.86 from
> lorien.san01.cooperativaobrera.coop not applied to 0.513.85: Failed application of an update diff
> Mar  8 08:43:27 mordor attrd[26214]:   notice: attrd_local_callback: Sending full refresh (origin=crmd)
> Mar  8 08:43:27 mordor attrd[26214]:   notice: attrd_trigger_update: Sending flush op to all hosts
> for: last-failure-srv_mysql_dss (1384966716)
> Mar  8 08:43:27 mordor crmd[26216]:   notice: do_state_transition: State transition S_PENDING ->
> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
> Mar  8 08:43:27 mordor attrd[26214]:   notice: attrd_trigger_update: Sending flush op to all hosts
> for: probe_complete (true)
> Mar  8 08:43:27 mordor attrd[26214]:   notice: attrd_trigger_update: Sending flush op to all hosts
> for: last-failure-vm_suc8300 (1384966659)
> Mar  8 08:43:43 mordor stonith-ng[26212]:   notice: remote_op_done: Operation reboot of
> gandalf.san01.cooperativaobrera.coop by lorien.san01.cooperativaobrera.coop for
> crmd.9934 at lorien.san01.cooperativaobrera.coop.3b50e760: OK
> Mar  8 08:43:43 mordor crmd[26216]:   notice: tengine_stonith_notify: Peer
> gandalf.san01.cooperativaobrera.coop was terminated (st_notify_fence) by
> lorien.san01.cooperativaobrera.coop for lorien.san01.cooperativaobrera.coop: OK
> (ref=3b50e760-1142-4d3a-92b8-6936eb9df58c) by client crmd.9934
> Mar  8 08:43:43 mordor crmd[26216]:   notice: tengine_stonith_notify: Notified CMAN that
> 'gandalf.san01.cooperativaobrera.coop' is now fenced
> 
> - Up to this point, gandalf resources are passed CORRECTLY to other cluster members. We had the
> cluster with 3 nodes online and 1 node offline. Everything was working ok.
> 
> - After gandalf reboot is complete we checked node status, and then started pacemaker again manually
> and node reacquires its resources CORRECTLY. So again we had the cluster with all 4 nodes online and
> all resources running in their preferred nodes. Log of last resource transition back to gandalf:
> 
> Mar  8 08:51:40 lorien pengine[9933]:   notice: LogActions: Move    fs_mysql_hermes#011(Started
> mordor.san01.cooperativaobrera.coop -> gandalf.san01.cooperativaobrera.coop)
> Mar  8 08:51:40 lorien pengine[9933]:   notice: LogActions: Move    ip_mysql_hermes#011(Started
> mordor.san01.cooperativaobrera.coop -> gandalf.san01.cooperativaobrera.coop)
> Mar  8 08:51:40 lorien pengine[9933]:   notice: process_pe_message: Calculated Transition 16:
> /var/lib/pacemaker/pengine/pe-input-14.bz2
> Mar  8 08:51:43 lorien crmd[9934]:   notice: run_graph: Transition 16 (Complete=9, Pending=0,
> Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-14.bz2): Complete
> Mar  8 08:51:43 lorien crmd[9934]:   notice: do_state_transition: State transition
> S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> 
> - So, we decided to delete the operation monitor from the resource that caused the timeout (a LVM
> Volume Group) to avoid having that same problem again. We did 'crm configure edit' and deleted two
> op monitor, and then just after commiting that modification to the crm we experienced the problem of
> resource duplication:
> 
> Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: Diff: --- 0.526.3
> Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: Diff: +++ 0.527.1 ebf608fdccda78e755a61cdeea18a659
> Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: --             <op
> id="vg_ifx_oltp-monitor-240" interval="240" name="monitor" timeout="120" />
> Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: --             <op id="vg_ifx_dss-monitor-240"
> interval="240" name="monitor" timeout="120" />
> Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: ++ <cib epoch="527" num_updates="1"
> admin_epoch="0" validate-with="pacemaker-1.2" crm_feature_set="3.0.7"
> update-origin="mordor.san01.cooperativaobrera.coop" update-client="crmd" cib-last-written="Sat Mar 
> 8 08:58:40 2014" have-quorum="1" dc-uuid="lorien.san01.cooperativaobrera.coop" />
> Mar  8 09:13:52 lorien crmd[9934]:   notice: do_state_transition: State transition S_IDLE ->
> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Mar  8 09:13:52 lorien crmd[9934]:   notice: do_state_transition: State transition S_ELECTION ->
> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ]
> Mar  8 09:13:52 lorien crmd[9934]:   notice: do_dc_takeover: Marking
> gandalf.san01.cooperativaobrera.coop, target of a previous stonith action, as clean

^^^ This is significant. 

The problem is triggered by crmsh performing a replace operation - which causes an election.
When the election is completed, the DC checks the stonith_cleanup_list (which evidently contains old data) and erases gandalf's status section (causing everyone to think its safely offline).

Do you see anything like:

> Mar  8 08:43:22 mordor crmd[26216]:   notice: tengine_stonith_notify: Target may have been our
> leader gandalf.san01.cooperativaobrera.coop (recorded: <unset>)

on lorien?

Strange that there was no DC at the time gandalf was fenced.  
Do you have a log file configured too? Those have more detail than syslog.

I have also made the following commit which may also help:
   https://github.com/beekhof/pacemaker/commit/cef2446

> Mar  8 09:13:52 lorien crmd[9934]:  warning: match_down_event: No match for shutdown action on
> gandalf.san01.cooperativaobrera.coop
> Mar  8 09:13:52 lorien crmd[9934]:   notice: peer_update_callback: Stonith/shutdown of
> gandalf.san01.cooperativaobrera.coop not matched
> Mar  8 09:13:52 lorien crmd[9934]:   notice: crm_update_peer_state: send_stonith_update: Node
> gandalf.san01.cooperativaobrera.coop[1] - state is now lost
> Mar  8 09:13:53 lorien crmd[9934]:  warning: do_state_transition: 3 cluster nodes failed to respond
> to the join offer.
> Mar  8 09:13:53 lorien crmd[9934]:  warning: do_log: FSA: Input I_ELECTION_DC from
> do_dc_join_finalize() received in state S_FINALIZE_JOIN
> Mar  8 09:13:53 lorien crmd[9934]:    error: crmd_ais_dispatch: Recieving messages from a node we
> think is dead: gandalf.san01.cooperativaobrera.coop[1]
> Mar  8 09:13:53 lorien crmd[9934]:  warning: match_down_event: No match for shutdown action on
> gandalf.san01.cooperativaobrera.coop
> Mar  8 09:13:53 lorien crmd[9934]:   notice: peer_update_callback: Stonith/shutdown of
> gandalf.san01.cooperativaobrera.coop not matched
> Mar  8 09:13:53 lorien crmd[9934]:   notice: do_election_count_vote: Election 6 (current: 6, owner:
> lorien.san01.cooperativaobrera.coop): Processed no-vote from gandalf.san01.cooperativaobrera.coop
> (Peer is not part of our cluster)
> Mar  8 09:14:00 lorien attrd[9932]:   notice: attrd_local_callback: Sending full refresh (origin=crmd)
> 
> - After that, node gandalf was in "pending" state but it was running all of its preferred resources.
> All other 3 nodes were online and also running its preferred resources, but then cluster instructed
> to start gandalf's resources in other nodes, but did not ordered to stonith gandalf, so => duplication:
> 
> Mar  8 09:14:00 lorien attrd[9932]:   notice: attrd_local_callback: Sending full refresh (origin=crmd)
> Mar  8 09:14:00 lorien attrd[9932]:   notice: attrd_trigger_update: Sending flush op to all hosts
> for: last-failure-srv_mysql_hermes (1394279494)
> Mar  8 09:14:00 lorien attrd[9932]:   notice: attrd_trigger_update: Sending flush op to all hosts
> for: last-failure-srv_mysql_oltp (1394279084)
> Mar  8 09:14:00 lorien attrd[9932]:   notice: attrd_trigger_update: Sending flush op to all hosts
> for: probe_complete (true)
> Mar  8 09:14:00 lorien attrd[9932]:   notice: attrd_trigger_update: Sending flush op to all hosts
> for: last-failure-srv_mysql_it (1384967023)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> st-lorien#011(isildur.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> fs_mysql_hermes#011(lorien.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> ip_mysql_hermes#011(lorien.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> srv_mysql_hermes#011(lorien.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> srv_jabberd#011(mordor.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> srv_conference#011(mordor.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> srv_orangutan#011(mordor.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> fs_jabber#011(mordor.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> ip_jabber#011(mordor.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> fs_bkoltp_ifx#011(mordor.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> ip_oltp#011(mordor.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> fs_mysql_oltp#011(mordor.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> srv_mysql_oltp#011(mordor.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> vg_ifx_oltp#011(mordor.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> srv_ifx_oltp#011(mordor.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> srv_xinetd_oltp#011(mordor.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> srv_xinetd_hermes#011(lorien.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: process_pe_message: Calculated Transition 53:
> /var/lib/pacemaker/pengine/pe-input-50.bz2
> 
> - These last logs are the DC orders to start gandalf's resources in other members.
> 
> - Our databases executed in duplicated instances for 7 minutes, but that was sufficient for build
> some pretty wild data corruption and we had to go to complete backup restore those.
> 
> Can you identify some problem in this operation? We did something wrong? Is this a known bug? I'm
> thinking of something related to the first stonith kill (gandalf was DC at that time) and some
> inconsistent state after that. Also there is an election (?) just in the moment of passing the cib
> modification to other members. Could this be related?
> 
> Many thanks in advance for any insight you could help us with.
> 
> Kind regards!
> 
> -- 
> Lic. Gabriel Gomiz - Jefe de Sistemas / Administrador
> ggomiz at cooperativaobrera.coop
> Gerencia de Sistemas - Cooperativa Obrera Ltda.
> Tel: (0291) 403-9700
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140318/843dc6d9/attachment-0003.sig>