[Pacemaker] Creating a safe cluster-node shutdown script (for when UPS goes OnBattery+LowBattery)
Digimer
lists at alteeve.ca
Fri Jul 4 14:17:07 UTC 2014
On 04/07/14 02:16 PM, Giuseppe Ragusa wrote:
> Hi all,
> I'm trying to create a script as per subject (on CentOS 6.5,
> CMAN+Pacemaker, only DRBD+KVM active/passive resources; SNMP-UPS
> monitored by NUT).
>
> Ideally I think that each node should stop (disable) all locally-running
> VirtualDomain resources (doing so cleanly demotes than downs the DRBD
> resources underneath), then put itself in standby and finally shutdown.
>
> On further startup, manual intervention would be required to unstandby
> all nodes and enable resources (nodes already in standby and resources
> already disabled before blackout should be manually distinguished).
>
> Is this strategy conceptually safe?
>
> Unfortunately, various searches have turned out no "prior art" :)
I started work on something similar with apcupsd (first I had to make it
work with multiple UPSes, which I did). Then I decided not to actually
implement, and decided instead to leave it up to an admin to decide
how/when/if to initiate a graceful shutdown.
My rationale was that this placed way too much potential damage in the
hands of, effectively, a single trigger. One bad bug and you could bring
down a perfectly fine cluster.
Instead, what I did was ensure that any power event triggered an alert
email (x2, as both nodes ran the monitoring app). This way, I (and the
client's admins) would be notified immediately if anything happened.
Then it was up to us to decide how/if to initiate a graceful shutdown.
One real-world example;
A couple months ago, a client's neighborhood was hit with a prolonged
power outage. Eventually, we decided to gracefully shut down. However,
one of the windows VMs had downloaded and prepped to install about 30
updates (no idea how this happened, except windows). Anyway, the VM took
more time to shut down than the batteries could support. So half-way
through, we withdrew one node and powered it off to shed load and gain
battery runtime. This kind of logic can not reasonably be coded into a
script.
My $0.02.
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
More information about the Pacemaker
mailing list