[Pacemaker] Node in pending state, resources duplicated and data corruption

Andrew Beekhof andrew at beekhof.net
Tue Mar 18 20:04:04 EDT 2014


On 19 Mar 2014, at 10:15 am, Andrew Beekhof <andrew at beekhof.net> wrote:

> 
> On 18 Mar 2014, at 10:04 pm, Gabriel Gomiz <ggomiz at cooperativaobrera.coop> wrote:
> 
>> Maybe, this is significant : 'Our DC node (gandalf.san01.cooperativaobrera.coop) left the cluster' ... ?
> 
> Very. I hadn't noticed it was the DC at the time it died.
> 
>> 
>> Please tell me if you need more details:
> 
> Can I get the file logs from lorien from Mar 08 08:43:00 to 09:14:00 please?
> 

Riiiight, so this is the story:

Mar 08 08:43:22 [9934] lorien       crmd:     info: do_dc_takeover: 	Taking over DC status for this partition
Mar 08 08:43:22 [9934] lorien       crmd:   notice: tengine_stonith_notify: 	Peer gandalf was terminated (st_notify_fence) by mordor for gandalf: OK (ref=10d27664-33ed-43e0-a5bd-7d0ef850eb05) by client crmd.31561
Mar 08 08:43:22 [9934] lorien       crmd:   notice: tengine_stonith_notify: 	Notified CMAN that 'gandalf' is now fenced
Mar 08 08:43:22 [9934] lorien       crmd:   notice: tengine_stonith_notify: 	Target may have been our leader gandalf (recorded: <unset>)
Mar 08 09:13:52 [9934] lorien       crmd:     info: do_dc_takeover: 	Taking over DC status for this partition
Mar 08 09:13:52 [9934] lorien       crmd:   notice: do_dc_takeover: 	Marking gandalf, target of a previous stonith action, as clean

In tengine_stonith_notify() we potentially add things to stonith_cleanup_list and then in do_dc_takeover() we check the stonith_cleanup_list and mark any nodes in it as clean.

As you can see above, the stonith notification comes just after the call to do_dc_takeover().
In the version you have there is some dodgy code in tengine_stonith_notify() which incorrectly adds gandalf to stonith_cleanup_list, causing Pacemaker to (incorrectly) erase its status section at 9:13:52 when another election occurs.

This was fixed during the RC-phase of Pacemaker-1.1.10:

  https://github.com/beekhof/pacemaker/commit/f30e1e43

I don't believe I quite understood the severity of that fix at the time (otherwise I'd have made more noise about it).

Since you're on CentOS 6.4, there should already be updated packages that include this fix.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140319/3677c98f/attachment-0003.sig>


More information about the Pacemaker mailing list