[Pacemaker] Best way to recover from failed STONITH?

Fri Dec 21 16:11:08 UTC 2012

On 12/21/2012 04:18 PM, Andrew Martin wrote:
> Hello,
> 
> Yesterday a power failure took out one of the nodes and its STONITH device (they share an upstream power source) in a 3-node active/passive cluster (Corosync 2.1.0, Pacemaker 1.1.8). After logging into the cluster, I saw that the STONITH operation had given up in failure and that none of the resources were running on the other nodes:
> Dec 20 17:59:14 [18909] quorumnode       crmd:   notice: too_many_st_failures:       Too many failures to fence node0 (11), giving up
> 
> I brought the failed node back online and it rejoined the cluster, but no more STONITH attempts were made and the resources remained stopped. Eventually I set stonith-enabled="false" ran killall on all pacemaker-related processes on the other (remaining) nodes, then restarted pacemaker, and the resources successfully migrated to one of the other nodes. This seems like a rather invasive technique. My questions about this type of situation are:
>  - is there a better way to tell the cluster "I have manually confirmed this node is dead/safe"? I see there is the meatclient command, but can that only be used with the meatware STONITH plugin?

crm node cleanup quorumnode

>  - in general, is there a way to force the cluster to start resources, if you just need to get them back online and as a human have confirmed that things are okay? Something like crm resource start rsc --force?

... see above ;-)

>  - how can I completely clear out saved data for the cluster and start over from scratch (last-resort option)? Stopping pacemaker and removing everything from /var/lib/pacemaker/cib and /var/lib/pacemaker/pengine cleans the CIB, but the nodes end up sitting in the "pending" state for a very long time (30 minutes or more). Am I missing another directory that needs to be cleared?

you started with an completely empty cib and the two (or three?) nodes
needed 30min to form a cluster?

> 
> I am going to look into making the power source for the STONITH device independent of the power source for the node itself, however even with that setup there's still a chance that something could take out both power sources at the same time, in which case manual intervention and confirmation that the node is dead would be required.

Pacemaker 1.1.8 supports (again) stonith topologies ... so more than one
fencing device and they can be "logically" combined.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> Thanks,
> 
> Andrew
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 222 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20121221/34955c29/attachment-0004.sig>