[ClusterLabs] Pacemaker responsible of DRBD and a systemd resource

Mon Nov 27 22:54:06 CET 2017

On Mon, 2017-11-13 at 10:24 -0500, Derek Wuelfrath wrote:
> Hello Ken !
> 
> > Make sure that the systemd service is not enabled. If pacemaker is
> > managing a service, systemd can't also be trying to start and stop
> > it.
> 
> It is not. I made sure of this in the first place :)
> 
> > Beyond that, the question is what log messages are there from
> > around
> > the time of the issue (on both nodes).
> 
> Well, that’s the thing. There is not much log messages telling what
> is actually happening. The ’systemd’ resource is not even trying to
> start (nothing in either log for that resource). Here are the logs
> from my last attempt:
> Scenario:
> - Services were running on ‘pancakeFence2’. DRBD was synced and
> connected
> - I rebooted ‘pancakeFence2’. Services failed to ‘pancakeFence1’
> - After ‘pancakeFence2’ comes back, services are running just fine on
> ‘pancakeFence1’ but DRBD is in Standalone due to split-brain
> 
> Logs for pancakeFence1: https://pastebin.com/dVSGPP78
> Logs for pancakeFence2: https://pastebin.com/at8qPkHE

When you say you rebooted the node, was it a clean reboot or a
simulated failure like power-off or kernel-panic? If it was a simulated
failure, then the behavior makes sense in this case. If a node
disappears for no known reason, DRBD ends up in split-brain. If fencing
were configured, the surviving node would fence the other one to be
sure it's down, but it might still be unable to reconnect to DRBD
without manual intervention.

The systemd issue is separate, and I can't think of what would cause
it. If you have PCMK_logfile set in /etc/sysconfig/pacemaker, you will
get more extensive log messages there. One node will be elected DC and
will have more "pengine:" messages than the other, that will show all
the decisions made about what actions to take, and the results of those
actions.

> It really looks like the status checkup mechanism of
> corosync/pacemaker for a systemd resource force the resource to
> “start” and therefore, start the ones above that resource in the
> group (DRBD in instance).
> This does not happen for a regular OCF resource (IPaddr2 per example)
> 
> Cheers!
> -dw
> 
> --
> Derek Wuelfrath
> dwuelfrath at inverse.ca :: +1.514.447.4918 (x110) :: +1.866.353.6153
> (x110)
> Inverse inc. :: Leaders behind SOGo (www.sogo.nu), PacketFence
> (www.packetfence.org) and Fingerbank (www.fingerbank.org)
> 
> > On Nov 10, 2017, at 11:39, Ken Gaillot <kgaillot at redhat.com> wrote:
> > 
> > On Thu, 2017-11-09 at 20:27 -0500, Derek Wuelfrath wrote:
> > > Hello there,
> > > 
> > > First post here but following since a while!
> > 
> > Welcome!
> > 
> > > Here’s my issue,
> > > we are putting in place and running this type of cluster since a
> > > while and never really encountered this kind of problem.
> > > 
> > > I recently set up a Corosync / Pacemaker / PCS cluster to manage
> > > DRBD
> > > along with different other resources. Part of theses resources
> > > are
> > > some systemd resources… this is the part where things are
> > > “breaking”.
> > > 
> > > Having a two servers cluster running only DRBD or DRBD with an
> > > OCF
> > > ipaddr2 resource (Cluser IP in instance) works just fine. I can
> > > easily move from one node to the other without any issue.
> > > As soon as I add a systemd resource to the resource group, things
> > > are
> > > breaking. Moving from one node to the other using standby mode
> > > works
> > > just fine but as soon as Corosync / Pacemaker restart involves
> > > polling of a systemd resource, it seems like it is trying to
> > > start
> > > the whole resource group and therefore, create a split-brain of
> > > the
> > > DRBD resource.
> > 
> > My first two suggestions would be:
> > 
> > Make sure that the systemd service is not enabled. If pacemaker is
> > managing a service, systemd can't also be trying to start and stop
> > it.
> > 
> > Fencing is the only way pacemaker can resolve split-brains and
> > certain
> > other situations, so that will help in the recovery.
> > 
> > Beyond that, the question is what log messages are there from
> > around
> > the time of the issue (on both nodes).
> > 
> > 
> > > It is the best explanation / description of the situation that I
> > > can
> > > give. If it need any clarification, examples, … I am more than
> > > open
> > > to share them.
> > > 
> > > Any guidance would be appreciated :)
> > > 
> > > Here’s the output of a ‘pcs config’
> > > 
> > > https://pastebin.com/1TUvZ4X9
> > > 
> > > Cheers!
> > > -dw
> > > 
> > > --
> > > Derek Wuelfrath
> > > dwuelfrath at inverse.ca :: +1.514.447.4918 (x110) ::
> > > +1.866.353.6153
> > > (x110)
> > > Inverse inc. :: Leaders behind SOGo (www.sogo.nu), PacketFence
> > > (www.packetfence.org) and Fingerbank (www.fingerbank.org)
> > -- 
> > Ken Gaillot <kgaillot at redhat.com>
> > 
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
> > h.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot <kgaillot at redhat.com>