[ClusterLabs] FLoating IP failing over but not failing back with active/active LDAP (dirsrv)

Thu Mar 10 15:00:58 UTC 2016

On 03/10/2016 08:48 AM, Bernie Jones wrote:
> A bit more info..
> 
>  
> 
> If, after I restart the failed dirsrv instance, I then perform a "pcs
> resource cleanup dirsrv-daemon" to clear the FAIL messages then the failover
> will work OK.
> 
> So it's as if the cleanup is changing the status in some way..
> 
>  
> 
> From: Bernie Jones [mailto:bernie at securityconsulting.ltd.uk] 
> Sent: 10 March 2016 08:47
> To: 'Cluster Labs - All topics related to open-source clustering welcomed'
> Subject: [ClusterLabs] FLoating IP failing over but not failing back with
> active/active LDAP (dirsrv)
> 
>  
> 
> Hi all, could you advise please?
> 
>  
> 
> I'm trying to configure a floating IP with an active/active deployment of
> 389 directory server. I don't want pacemaker to manage LDAP but just to
> monitor and switch the IP as required to provide resilience. I've seen some
> other similar threads and based my solution on those.
> 
>  
> 
> I've amended the ocf for slapd to work with 389 DS and this tests out OK
> (dirsrv).
> 
>  
> 
> I've then created my resources as below:
> 
>  
> 
> pcs resource create dirsrv-ip ocf:heartbeat:IPaddr2 ip="192.168.26.100"
> cidr_netmask="32" op monitor timeout="20s" interval="5s" op start
> interval="0" timeout="20" op stop interval="0" timeout="20"
> 
> pcs resource create dirsrv-daemon ocf:heartbeat:dirsrv op monitor
> interval="10" timeout="5" op start interval="0" timeout="5" op stop
> interval="0" timeout="5" meta "is-managed=false"

is-managed=false means the cluster will not try to start or stop the
service. It should never be used in regular production, only when doing
maintenance on the service.

> pcs resource clone dirsrv-daemon meta globally-unique="false"
> interleave="true" target-role="Started" "master-max=2"
> 
> pcs constraint colocation add dirsrv-daemon-clone with dirsrv-ip
> score=INFINITY

This constraint means that dirsrv is only allowed to run where dirsrv-ip
is. I suspect you want the reverse, dirsrv-ip with dirsrv-daemon-clone,
which means keep the IP with a working dirsrv instance.

> pcs property set no-quorum-policy=ignore

If you're using corosync 2, you generally don't need or want this.
Instead, ensure corosync.conf has two_node: 1 (which will be done
automatically if you used pcs cluster setup).

> pcs resource defaults migration-threshold=1
> 
> pcs property set stonith-enabled=false
> 
>  
> 
> On startup all looks well:
> 
> ____________________________________________________________________________
> ____________
> 
>  
> 
> Last updated: Thu Mar 10 08:28:03 2016
> 
> Last change: Thu Mar 10 08:26:14 2016
> 
> Stack: cman
> 
> Current DC: ga2.idam.com - partition with quorum
> 
> Version: 1.1.11-97629de
> 
> 2 Nodes configured
> 
> 3 Resources configured
> 
>  
> 
>  
> 
> Online: [ ga1.idam.com ga2.idam.com ]
> 
>  
> 
> dirsrv-ip   (ocf::heartbeat:IPaddr2):     Started ga1.idam.com
> 
>  Clone Set: dirsrv-daemon-clone [dirsrv-daemon]
> 
>      dirsrv-daemon      (ocf::heartbeat:dirsrv):        Started ga2.idam.com
> (unmanaged)
> 
>      dirsrv-daemon      (ocf::heartbeat:dirsrv):        Started ga1.idam.com
> (unmanaged)
> 
>  
> 
>  
> 
> ____________________________________________________________________________
> ____________
> 
>  
> 
> Stop dirsrv on ga1:
> 
>  
> 
> Last updated: Thu Mar 10 08:28:43 2016
> 
> Last change: Thu Mar 10 08:26:14 2016
> 
> Stack: cman
> 
> Current DC: ga2.idam.com - partition with quorum
> 
> Version: 1.1.11-97629de
> 
> 2 Nodes configured
> 
> 3 Resources configured
> 
>  
> 
>  
> 
> Online: [ ga1.idam.com ga2.idam.com ]
> 
>  
> 
> dirsrv-ip   (ocf::heartbeat:IPaddr2):     Started ga2.idam.com
> 
>  Clone Set: dirsrv-daemon-clone [dirsrv-daemon]
> 
>      dirsrv-daemon      (ocf::heartbeat:dirsrv):        Started ga2.idam.com
> (unmanaged)
> 
>      dirsrv-daemon      (ocf::heartbeat:dirsrv):        FAILED ga1.idam.com
> (unmanaged)
> 
>  
> 
> Failed actions:
> 
>     dirsrv-daemon_monitor_10000 on ga1.idam.com 'not running' (7): call=12,
> status=complete, last-rc-change='Thu Mar 10 08:28:41 2016', queued=0ms,
> exec=0ms
> 
>  
> 
> IP fails over to ga2 OK:
> 
>  
> 
> ____________________________________________________________________________
> ____________
> 
>  
> 
> Restart dirsrv on ga1
> 
>  
> 
> Last updated: Thu Mar 10 08:30:01 2016
> 
> Last change: Thu Mar 10 08:26:14 2016
> 
> Stack: cman
> 
> Current DC: ga2.idam.com - partition with quorum
> 
> Version: 1.1.11-97629de
> 
> 2 Nodes configured
> 
> 3 Resources configured
> 
>  
> 
>  
> 
> Online: [ ga1.idam.com ga2.idam.com ]
> 
>  
> 
> dirsrv-ip   (ocf::heartbeat:IPaddr2):     Started ga2.idam.com
> 
>  Clone Set: dirsrv-daemon-clone [dirsrv-daemon]
> 
>      dirsrv-daemon      (ocf::heartbeat:dirsrv):        Started ga2.idam.com
> (unmanaged)
> 
>      dirsrv-daemon      (ocf::heartbeat:dirsrv):        Started ga1.idam.com
> (unmanaged)
> 
>  
> 
> Failed actions:
> 
>     dirsrv-daemon_monitor_10000 on ga1.idam.com 'not running' (7): call=12,
> status=complete, last-rc-change='Thu Mar 10 08:28:41 2016', queued=0ms,
> exec=0ms
> 
>  
> 
> ____________________________________________________________________________
> ____________
> 
>  
> 
> Stop dirsrv on ga2:
> 
>  
> 
> Last updated: Thu Mar 10 08:31:14 2016
> 
> Last change: Thu Mar 10 08:26:14 2016
> 
> Stack: cman
> 
> Current DC: ga2.idam.com - partition with quorum
> 
> Version: 1.1.11-97629de
> 
> 2 Nodes configured
> 
> 3 Resources configured
> 
>  
> 
>  
> 
> Online: [ ga1.idam.com ga2.idam.com ]
> 
>  
> 
> dirsrv-ip   (ocf::heartbeat:IPaddr2):     Started ga2.idam.com
> 
>  Clone Set: dirsrv-daemon-clone [dirsrv-daemon]
> 
>      dirsrv-daemon      (ocf::heartbeat:dirsrv):        FAILED ga2.idam.com
> (unmanaged)
> 
>      dirsrv-daemon      (ocf::heartbeat:dirsrv):        Started ga1.idam.com
> (unmanaged)
> 
>  
> 
> Failed actions:
> 
>     dirsrv-daemon_monitor_10000 on ga2.idam.com 'not running' (7): call=11,
> status=complete, last-rc-change='Thu Mar 10 08:31:12 2016', queued=0ms,
> exec=0ms
> 
>     dirsrv-daemon_monitor_10000 on ga1.idam.com 'not running' (7): call=12,
> status=complete, last-rc-change='Thu Mar 10 08:28:41 2016', queued=0ms,
> exec=0ms
> 
>  
> 
> But IP stays on failed node
> 
> Looking in the logs it seems that the cluster is not aware that ga1 is
> available even though the status output shows it is.
> 
>  
> 
> If I repeat the tests but with ga2 started up first the behaviour is similar
> i.e. it fails over to ga1 but not back to ga2.
> 
>  
> 
> Many thanks,
> 
> Bernie