[Pacemaker] Node lost early in HA startup --> no STONITH

Mon Aug 3 15:14:47 UTC 2015

Sounds similar to the issue I described here last week.  We also had two nodes, and lost network connection between the two nodes while one was starting up after a fence.  Although we had stonith resources configured, those resources were never called, and the cluster was considered active on both nodes throughout the network split.  We were able to reproduce this issue in our lab, it seems there is a window during corosync startup where if a node joins the cluster and then leaves before Pacemaker stonith resources have started, it will not be fenced.  This issue may be isolated to two node systems, as normally a single node that is separated from cluster will have lost quorum, which is not the case with two_node.

Are you running with "two_node" in corosync.conf?
Are you running with "wait_for_all"? (It's on by default with "two_node")

________________________________
From: Chris Walker [christopher.walker at gmail.com]
Sent: Sunday, August 02, 2015 23:02
To: pacemaker at oss.clusterlabs.org
Subject: [Pacemaker] Node lost early in HA startup --> no STONITH

Hello,

We recently had an unfortunate sequence on our two-node cluster (nodes n02 and n03) that can be summarized as:
1.  n03 became pathologically busy and was STONITHed by n02
2.  The heavy load migrated to n02, which also became pathologically busy
3.  n03 was rebooted
4.  During the startup of HA on n03, n02 was initially seen by n03:

Jul 26 15:23:43 n03 crmd: [143569]: info: crm_update_peer_proc: n02.ais is now online

5.  But later during the startup sequence (after DC election and CIB sync) we see n02 die (n02 is really wrapped around the axle, many stuck threads, etc)

Jul 26 15:27:44 n03 heartbeat: [143544]: WARN: node n02: is dead
...
Jul 26 15:27:45 n03 crmd: [143569]: info: ais_status_callback: status: n02 is now lost (was member)

our deadtime is 240 seconds, so n02 became unresponsive almost immediately after n03 reported it up at 15:23:43

6.  The troubling aspect of this incident is that even though there are multiple STONITH resources configured for n03, none of them was engaged and n03 then mounted filesystems that were also active on n02.

I'm wondering whether the fact that no STONITH resources were started by this time explains why n02 was not STONITHed.  Shortly after n02 is declared dead we see STONITH resources begin starting, e.g.,

Jul 26 15:27:47 n03 pengine: [152499]: notice: LogActions: Start   n03-3-ipmi-stonith (n03)

Does the fact that since there were no active STONITH resources when n02 was declared dead, no STONITH action was taken against this node?  Is there a fix/workaround for this scenario (we're using heartbeat 3.0.5 and pacemaker 3.1.6 (RHEL6.2))?

Thanks very much!
Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20150803/93e704ee/attachment.htm>