[Pacemaker] Node in pending state, resources duplicated and data corruption

Gabriel Gomiz ggomiz at cooperativaobrera.coop
Wed Mar 19 06:56:38 EDT 2014


On 03/18/2014 09:04 PM, Andrew Beekhof wrote:
> Riiiight, so this is the story:
>
> Mar 08 08:43:22 [9934] lorien       crmd:     info: do_dc_takeover: 	Taking over DC status for this partition
> Mar 08 08:43:22 [9934] lorien       crmd:   notice: tengine_stonith_notify: 	Peer gandalf was terminated (st_notify_fence) by mordor for gandalf: OK (ref=10d27664-33ed-43e0-a5bd-7d0ef850eb05) by client crmd.31561
> Mar 08 08:43:22 [9934] lorien       crmd:   notice: tengine_stonith_notify: 	Notified CMAN that 'gandalf' is now fenced
> Mar 08 08:43:22 [9934] lorien       crmd:   notice: tengine_stonith_notify: 	Target may have been our leader gandalf (recorded: <unset>)
> Mar 08 09:13:52 [9934] lorien       crmd:     info: do_dc_takeover: 	Taking over DC status for this partition
> Mar 08 09:13:52 [9934] lorien       crmd:   notice: do_dc_takeover: 	Marking gandalf, target of a previous stonith action, as clean
>
> In tengine_stonith_notify() we potentially add things to stonith_cleanup_list and then in do_dc_takeover() we check the stonith_cleanup_list and mark any nodes in it as clean.
>
> As you can see above, the stonith notification comes just after the call to do_dc_takeover().
> In the version you have there is some dodgy code in tengine_stonith_notify() which incorrectly adds gandalf to stonith_cleanup_list, causing Pacemaker to (incorrectly) erase its status section at 9:13:52 when another election occurs.
>
> This was fixed during the RC-phase of Pacemaker-1.1.10:
>
>   https://github.com/beekhof/pacemaker/commit/f30e1e43
>
> I don't believe I quite understood the severity of that fix at the time (otherwise I'd have made more noise about it).
>
> Since you're on CentOS 6.4, there should already be updated packages that include this fix.

Andrew: thanks again for taking the time to check this case. We will be updating to 1.1.10 as soon as possible. Hugs!


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 555 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140319/0413cd36/attachment-0003.sig>


More information about the Pacemaker mailing list