[Pacemaker] Creating a safe cluster-node shutdown script (for when UPS goes OnBattery+LowBattery)

Tue Jul 15 23:43:55 CEST 2014

On Mon, Jul 14, 2014, at 02:07, Andrew Beekhof wrote:
> 
> On 10 Jul 2014, at 11:17 am, Giuseppe Ragusa <giuseppe.ragusa at hotmail.com> wrote:
> 
> > On Thu, Jul 10, 2014, at 00:06, Andrew Beekhof wrote:
> >> 
> >> On 9 Jul 2014, at 10:28 pm, Giuseppe Ragusa <giuseppe.ragusa at hotmail.com> wrote:
> >> 
> >>> On Tue, Jul 8, 2014, at 02:59, Andrew Beekhof wrote:
> >>>> 
> >>>> On 4 Jul 2014, at 3:16 pm, Giuseppe Ragusa <giuseppe.ragusa at hotmail.com> wrote:
> >>>> 
> >>>>> Hi all,
> >>>>> I'm trying to create a script as per subject (on CentOS 6.5, CMAN+Pacemaker, only DRBD+KVM active/passive resources; SNMP-UPS monitored by NUT).
> >>>>> 
> >>>>> Ideally I think that each node should stop (disable) all locally-running VirtualDomain resources (doing so cleanly demotes than downs the DRBD resources underneath), then put itself in standby and finally shutdown.
> >>>> 
> >>>> Since the end goal is shutdown, why not just run 'pcs cluster stop' ?
> >>> 
> >>> I thought that this action would cause communication interruption (since Corosync would be not responding to the peer) and so cause the other node to stonith us;
> >> 
> >> No. Shutdown is a globally co-ordinated process.
> >> We don't fence nodes we know shut down cleanly.
> > 
> > Thanks for the clarification.
> > Now that you said it, it seems also logical and even obvious ;>
> > 
> >>> I know that ideally the other node too should perform "pcs cluster stop" in short, since the same UPS powers both, but I worry about timing issues (and "races") in UPS monitoring since it is a large Enterprise UPS monitored by SNMP.
> >>> 
> >>> Furthermore I do not know what happens to running resources at "pcs cluster stop": I infer from your suggestion that resources are brought down and not migrated on the other node, correct?
> >> 
> >> If the other node is shutting down too, they'll simply be stopped.
> >> Otherwise we'll try to move them.
> > 
> > It's the "moving" that worries me :)
> 
> They'll be stopped again as soon as the second node says "shutdown" too.

Ok, understood.
I think that the "pcs cluster stop --all" seems the best option (avoids moving/stopping again), see below for more reasoning about this...

> >>>> Possibly with 'pcs cluster standby' first if you're worried that stopping the resources might take too long.
> > 
> > I forgot to ask: in which way would a previous standby make the resources stop sooner?
> 
> It won't.  But it will let the resources take as long as they need before the node says "you took too long, i'm outta here".
> It all depends on the shutdown-escalation timeout.
> 
> > 
> >>> I thought that "pcs cluster standby" would usually migrate the resources to the other node (I actually tried it and confirmed the expected behaviour); so this would risk to become a race with the timing of the other node standby,
> >> 
> >> Not really, at the point the second node runs 'standby' we'll stop trying to migrate services and just stop them everywhere.
> >> Again, this is a centrally controlled process, timing isn't a problem.
> > 
> > I understand that, "eventually", timing won't be a problem and resources will "eventually" stop, but from your description I'm afraid that some delaying could result in the total shutdown process, arising from possibly unsynchronized UPS notifications on the nodes (first node starts standby, resources start to move, THEN second node starts standby).
> 
> Worst case, a start action starts just before the second node goes into standby.
> The maximum delay is then the length of time that action could take.
> 
> Is it long? Is it significant? I can't say :)

It seems all really quick from my tests, so I would say "not long and not significant" but a "clean looking" and "general" solution should try to minimize unnecessary delays/actions, if it is possible and does not bring negative consequences or too much additional complexity (or so I think)

> > So now I'm taking your advice and I'll modify the script to user cluster stop but, with the aim of avoiding the aforementioned delay (if it actually represents a possibility), I would like to ask you three questions:
> > 
> > *) if I simply issue a "pcs cluster stop --all" from the first node that gets notified of UPS critical status, do I risk any adverse effect when the other node asynchronously gives the same command some time later (before/after the whole cluster stop sequence completes)?
> 
> I don't believe so.

Ok, many thanks.
This was confirmed by Tomas Jelinek too.

> > *) does the aforementioned "pcs cluster stop --all" command return only after the cluster stop sequence has actually/completely ended (so as to safely issue a "shutdown -h now" immediately afterwards)?
> 
> Chris would know for sure.

Tomas Jelinek explained that (summing it up) it should be "safe and clean", with errors reported only if the other node reports errors on stopping pacemaker/cman or does not respond at all (completely off or simply with pcsd stopped), but I still have a small doubt about the outcome of a node issuing "pcs cluster stop --all" and the other one doing the same while the actions initiated by the first still have not completed... thinking of the single actions: I don't know if pacemaker/cman simply "mask/halt" further stop requests while one is already running

> > *) is the "pcs cluster stop --all" command known to work reliably on current CentOS 6.5? (I ask since I found some discussion around "pcs cluster start" related bugs)
> 
> I think --all won't work on 6.5 (since pcsd isn't around (yet))

This too has been pointed out by Tomas Jelinek and I'm a bit wary of compiling an updated pacemaker rpm myself and I would rather wait for RHEL/CentOS 6.6 which should bring an updated versions (judging from history and from various bugzilla discussions).

In the meantime I plan to either refine my original script (but it looks uglier each time I look at it...) or (better, I think) put the whole cluster in maintenance-mode (an action that should be repeatable and timing-insensitive) and delegate actual resource stopping to the ordinary shutdown runlevel actions on each node (drbd, libvirt-guests etc...): any comments on this course of action (even a single "don't ever do that" ;> )

BTW: while performing some tests I noticed that, unlike "pcs cluster stop --all", the "pcs cluster unstandby --all" action does not need pcsd (worked with CentOS 6.5 Pacemaker), so another idea pops up: use "pcs cluster standby --all" and then shutdown... available now and quite "similar" for my purposes; a single manual "unstandby --all" should revert it afterwards (so this is different: the "stop" case would not need further actions manually performed on startup).

Many thanks again.

Regards,
Giuseppe

> > Many thanks again for your invaluable help and insight.
> > 
> > Regards,
> > Giuseppe
> > 
> >>> so this is why I took the hassle of explicitly and orderly stopping all locally-running resources in my script BEFORE putting the local node in standby.
> >>> 
> >>>> Pacemaker will stop everything in the required order and stop the node when done... problem solved?
> >>> 
> >>> I thought that after a "pcs cluster standby" a regular "shutdown -h" of the operating system would cleanly bring down the cluster too,
> >> 
> >> It should do
> >> 
> >>> without the need for a "pcs cluster stop", given that both Pacemaker and CMAN are correctly configured for automatic startup/shutdown as operating system services (SysV initscripts controlled by CentOS 6.5 Upstart, in my case).
> >>> 
> >>> Many thanks again for your always thought-provoking and informative answers!
> >>> 
> >>> Regards,
> >>> Giuseppe
> >>> 
> >>>>> 
> >>>>> On further startup, manual intervention would be required to unstandby all nodes and enable resources (nodes already in standby and resources already disabled before blackout should be manually distinguished).
> >>>>> 
> >>>>> Is this strategy conceptually safe?
> >>>>> 
> >>>>> Unfortunately, various searches have turned out no "prior art" :)
> >>>>> 
> >>>>> This is my tentative script (consider it in the public domain):
> >>>>> 
> >>>>> ------------------------------------------------------------------------------------------------------------------------------------
> >>>>> #!/bin/bash
> >>>>> 
> >>>>> # Note: "pcs cluster status" still has a small bug vs. CMAN-controlled Corosync and would always return != 0
> >>>>> pcs status > /dev/null 2>&1
> >>>>> STATUS=$?
> >>>>> 
> >>>>> # Detect if cluster is running at all on local node
> >>>>> # TODO: detect node already in standby and bypass this
> >>>>> if [ "${STATUS}" = 0 ]; then
> >>>>>   local_node="$(cman_tool status | grep -i 'Node[[:space:]]*name:' | sed -e 's/^.*Node\s*name:\s*\([^[:space:]]*\).*$/\1/i')"
> >>>>>   for local_resource in $(pcs status 2>/dev/null | grep "ocf::heartbeat:VirtualDomain.*${local_node}\\s*\$" | awk '{print $1}'); do
> >>>>>       pcs resource disable "${local_resource}"
> >>>>>   done
> >>>>>   # TODO: each resource disabling above may return without waiting for complete stop - wait here for "no more resources active"? (but avoid endless loops)
> >>>>>   pcs cluster standby "${local_node}"
> >>>>> fi
> >>>>> 
> >>>>> # Shut down gracefully anyway at the end
> >>>>> /sbin/shutdown -h +0
> >>>>> 
> >>>>> ------------------------------------------------------------------------------------------------------------------------------------
> >>>>> 
> >>>>> Comments/suggestions/improvements are more than welcome.
> >>>>> 
> >>>>> Many thanks in advance.
> >>>>> 
> >>>>> Regards,
> >>>>> Giuseppe
> >>>>> 
> >>>>> _______________________________________________
> >>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>>> 
> >>>>> Project Home: http://www.clusterlabs.org
> >>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>>> Bugs: http://bugs.clusterlabs.org
> >>>> 
> >>>> _______________________________________________
> >>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>> 
> >>>> Project Home: http://www.clusterlabs.org
> >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>> Bugs: http://bugs.clusterlabs.org
> >>>> Email had 1 attachment:
> >>>> + signature.asc
> >>>> 1k (application/pgp-signature)
> >>> -- 
> >>> Giuseppe Ragusa
> >>> giuseppe.ragusa at fastmail.fm
> >>> 
> >>> 
> >>> _______________________________________________
> >>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>> 
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> Bugs: http://bugs.clusterlabs.org
> >> 
> >> _______________________________________________
> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >> 
> >> Project Home: http://www.clusterlabs.org
> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >> Email had 1 attachment:
> >> + signature.asc
> >>  1k (application/pgp-signature)
> > -- 
> >  Giuseppe Ragusa
> >  giuseppe.ragusa at fastmail.fm
> > 
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> Email had 1 attachment:
> + signature.asc
>   1k (application/pgp-signature)
-- 
  Giuseppe Ragusa
  giuseppe.ragusa at fastmail.fm