[Pacemaker] Node lost early in HA startup --> no STONITH

Mon Aug 3 11:27:33 EDT 2015

>From what I see, he is using heartbeat.

2015-08-03 17:14 GMT+02:00 Thomas Meagher <thomas.meagher at hds.com>:
>
> Sounds similar to the issue I described here last week.  We also had two
> nodes, and lost network connection between the two nodes while one was
> starting up after a fence.  Although we had stonith resources configured,
> those resources were never called, and the cluster was considered active on
> both nodes throughout the network split.  We were able to reproduce this
> issue in our lab, it seems there is a window during corosync startup where
> if a node joins the cluster and then leaves before Pacemaker stonith
> resources have started, it will not be fenced.  This issue may be isolated
> to two node systems, as normally a single node that is separated from
> cluster will have lost quorum, which is not the case with two_node.
>
> Are you running with "two_node" in corosync.conf?
> Are you running with "wait_for_all"? (It's on by default with "two_node")
>
> ________________________________
> From: Chris Walker [christopher.walker at gmail.com]
> Sent: Sunday, August 02, 2015 23:02
> To: pacemaker at oss.clusterlabs.org
> Subject: [Pacemaker] Node lost early in HA startup --> no STONITH
>
> Hello,
>
> We recently had an unfortunate sequence on our two-node cluster (nodes n02
> and n03) that can be summarized as:
> 1.  n03 became pathologically busy and was STONITHed by n02
> 2.  The heavy load migrated to n02, which also became pathologically busy
> 3.  n03 was rebooted
> 4.  During the startup of HA on n03, n02 was initially seen by n03:
>
> Jul 26 15:23:43 n03 crmd: [143569]: info: crm_update_peer_proc: n02.ais is
> now online
>
> 5.  But later during the startup sequence (after DC election and CIB sync)
> we see n02 die (n02 is really wrapped around the axle, many stuck threads,
> etc)
>
> Jul 26 15:27:44 n03 heartbeat: [143544]: WARN: node n02: is dead
> ...
> Jul 26 15:27:45 n03 crmd: [143569]: info: ais_status_callback: status: n02
> is now lost (was member)
>
> our deadtime is 240 seconds, so n02 became unresponsive almost immediately
> after n03 reported it up at 15:23:43
>
> 6.  The troubling aspect of this incident is that even though there are
> multiple STONITH resources configured for n03, none of them was engaged and
> n03 then mounted filesystems that were also active on n02.
>
> I'm wondering whether the fact that no STONITH resources were started by
> this time explains why n02 was not STONITHed.  Shortly after n02 is declared
> dead we see STONITH resources begin starting, e.g.,
>
> Jul 26 15:27:47 n03 pengine: [152499]: notice: LogActions: Start
> n03-3-ipmi-stonith (n03)
>
> Does the fact that since there were no active STONITH resources when n02 was
> declared dead, no STONITH action was taken against this node?  Is there a
> fix/workaround for this scenario (we're using heartbeat 3.0.5 and pacemaker
> 3.1.6 (RHEL6.2))?
>
> Thanks very much!
> Chris
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>


-- 
  .~.
  /V\
 //  \\
/(   )\
^`~'^