[Pacemaker] [RFC PATCH] Try to fix startup-fencing not happening

Andrew Beekhof andrew at beekhof.net
Wed Jul 11 05:20:07 CEST 2012


I know this thread is over a year old and you may not care anymore,
but I was never satisfied with the answers I gave, so it stayed on my
todo list.

I've done some work recently improving the function for detecting when
to fence nodes and gracefully handling any non-pacemaker nodes that
are part of the cluster, but I think the main problem is that I have
almost always done all of my testing with multicast.
With multicast, the cluster never tells me about nodes it hasn't seen
yet - whereas unicast based clusters report those nodes as down.
Figuring out how to distinguish between 'down' and 'down but we've
never seen them before' is the tricky part.

So I've added a new item to do some testing with unicast clusters and
I should have it sorted out in the next few weeks.
Thanks for bringing the problem to my attention.

-- Andrew

On Fri, Mar 18, 2011 at 9:54 AM, Simone Gotti <simone.gotti at gmail.com> wrote:
> Hi,
>
> When using corosync + pcmk v1 starting both corosync and pacemakerd (and
> I think also using heartbeat or anything other than cman) as quorum
> provider, at startup in the CIB will not be a <node_state/> entry for
> the nodes that are not in cluster.
>
> Instead when using cman as quorum provider there will be a <node_state>
> for every node known by cman as lib/common/ais.c:cman_event_callback
> calls crm_update_peer for every node reported by cman_get_nodes.
>
> Something similar will happen when using corosync+pcmkv1 if corosync is
> started on N nodes but pacemakerd is started only on N-M nodes.
>
> All of this will break 'startup-fencing' because, from my understanding,
> the logic is this:
>
> 1) At startup all the nodes are marked (in
> lib/pengine/unpack.c:unpack_node) as unclean.
> 2) lib/pengine/unpack.c:unpack_status will cycle only the available
> <node_state/> in the cib status section resetting them to a clean status
> at the start and then putting them as unclean if some conditions are met.
> 3) pengine/allocate.c:stage6 all the unclean nodes are fenced.
>
> In the above conditions you'll have a <node_state/> in the cib status
> section also for nodes without pacemakerd enabled and the startup
> fencing won't happen because there isn't any condition in unpack_status
> that will mark them as unclean.
>
>
> I'm not very expert of the code. I discarded the solution to not
> register at startup all the nodes known by cman but only the active ones
> as it won't fix the corosync+pcmkv1 case.
>
> Instead I tried to understand when a node that has its status in the cib
> should be startup fenced and a possible solution is in the attached patch.
> I noticed that when crm_update_peer inserts a new node this one doesn't
> have the expected attribute set. So if startup-fencing is enabled I'm
> going to set the node as expected up.
>
>
> Thanks!
> Bye!
>
> --
> Simone Gotti
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>



More information about the Pacemaker mailing list