[Pacemaker] Node lost early in HA startup --> no STONITH

Mon Aug 3 11:38:04 EDT 2015

I saw Thomas's post from last week and it sounded very similar to what we
saw, but I wasn't sure if the heartbeat/corosync difference made this a
different issue.  I'm trying to dup and assemble the log/config info.

Thanks again,
Chris

On Mon, Aug 3, 2015 at 11:27 AM, emmanuel segura <emi2fast at gmail.com> wrote:

> From what I see, he is using heartbeat.
>
> 2015-08-03 17:14 GMT+02:00 Thomas Meagher <thomas.meagher at hds.com>:
> >
> > Sounds similar to the issue I described here last week.  We also had two
> > nodes, and lost network connection between the two nodes while one was
> > starting up after a fence.  Although we had stonith resources configured,
> > those resources were never called, and the cluster was considered active
> on
> > both nodes throughout the network split.  We were able to reproduce this
> > issue in our lab, it seems there is a window during corosync startup
> where
> > if a node joins the cluster and then leaves before Pacemaker stonith
> > resources have started, it will not be fenced.  This issue may be
> isolated
> > to two node systems, as normally a single node that is separated from
> > cluster will have lost quorum, which is not the case with two_node.
> >
> > Are you running with "two_node" in corosync.conf?
> > Are you running with "wait_for_all"? (It's on by default with "two_node")
> >
> > ________________________________
> > From: Chris Walker [christopher.walker at gmail.com]
> > Sent: Sunday, August 02, 2015 23:02
> > To: pacemaker at oss.clusterlabs.org
> > Subject: [Pacemaker] Node lost early in HA startup --> no STONITH
> >
> > Hello,
> >
> > We recently had an unfortunate sequence on our two-node cluster (nodes
> n02
> > and n03) that can be summarized as:
> > 1.  n03 became pathologically busy and was STONITHed by n02
> > 2.  The heavy load migrated to n02, which also became pathologically busy
> > 3.  n03 was rebooted
> > 4.  During the startup of HA on n03, n02 was initially seen by n03:
> >
> > Jul 26 15:23:43 n03 crmd: [143569]: info: crm_update_peer_proc: n02.ais
> is
> > now online
> >
> > 5.  But later during the startup sequence (after DC election and CIB
> sync)
> > we see n02 die (n02 is really wrapped around the axle, many stuck
> threads,
> > etc)
> >
> > Jul 26 15:27:44 n03 heartbeat: [143544]: WARN: node n02: is dead
> > ...
> > Jul 26 15:27:45 n03 crmd: [143569]: info: ais_status_callback: status:
> n02
> > is now lost (was member)
> >
> > our deadtime is 240 seconds, so n02 became unresponsive almost
> immediately
> > after n03 reported it up at 15:23:43
> >
> > 6.  The troubling aspect of this incident is that even though there are
> > multiple STONITH resources configured for n03, none of them was engaged
> and
> > n03 then mounted filesystems that were also active on n02.
> >
> > I'm wondering whether the fact that no STONITH resources were started by
> > this time explains why n02 was not STONITHed.  Shortly after n02 is
> declared
> > dead we see STONITH resources begin starting, e.g.,
> >
> > Jul 26 15:27:47 n03 pengine: [152499]: notice: LogActions: Start
> > n03-3-ipmi-stonith (n03)
> >
> > Does the fact that since there were no active STONITH resources when n02
> was
> > declared dead, no STONITH action was taken against this node?  Is there a
> > fix/workaround for this scenario (we're using heartbeat 3.0.5 and
> pacemaker
> > 3.1.6 (RHEL6.2))?
> >
> > Thanks very much!
> > Chris
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
>
>
> --
>   .~.
>   /V\
>  //  \\
> /(   )\
> ^`~'^
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20150803/9c758319/attachment-0003.html>