[Pacemaker] IP Range Failover with IPaddr2 and clone / globally-unique="true"

Wed Jan 25 15:45:59 UTC 2012

----- Original Message -----
> From: "Anton Melser" <melser.anton at gmail.com>
> To: "Jake Smith" <jsmith at argotec.com>, "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Wednesday, January 25, 2012 9:24:09 AM
> Subject: Re: [Pacemaker] IP Range Failover with IPaddr2 and clone / globally-unique="true"
> 
> > Let's try that again with something useful!
> >
> > I'm not an expert on it but...
> >
> > unique_clone_address:
> > If true, add the clone ID to the supplied value of ip to create a
> > unique address to manage (optional, boolean, default false)
> >
> > So for example:
> > primitive ClusterIP ocf:heartbeat:IPaddr2 \
> >    params ip="10.0.0.1" cidr_netmask="32" clusterip_hash="sourceip"
> >    \
> >    op monitor interval="30s"
> > clone CloneIP ClusterIP \
> >    meta globally-unique="true" clone-max="8"
> >
> > would result in 8 ip's: 10.0.0.2, 10.0.0.3, etc.
> 
> Ok, so I have reinstalled everything and have a clean setup. However,
> it still ain't workin unfortunately. Can you explain how I'm supposed
> to use unique_clone_address? This is mentioned at the start of the
> thread but not with the command. I tried doing what you suggest here:
> 
> # primitive ClusterIP.144.1 ocf:heartbeat:IPaddr2 params
> ip="10.144.1.1" cidr_netmask="32" clusterip_hash="sourceip" op
> monitor
> interval="120s"
> # clone CloneIP ClusterIP.144.1 meta globally-unique="true"
> clone-max="8"
> 

As Dejan said I missed the clone-node-max="8" (it defaults to the number of nodes so it only started 2 instances of the clone)

I also missed something in the primitive which would have caused it to only create one IP no matter what pacemaker said it had created.  I tested with 8 IP's and ip address show only showed one even though it said 8 were started on the node.  You also have to have unique_clone_address="true" on the primitive.  Here's the example I tested successfully (pinged all 8 without issue from other node):

root at Condor:~# crm configure show p_testIPs
primitive p_testIPs ocf:heartbeat:IPaddr2 \
        params ip="192.168.2.104" cidr_netmask="29" clusterip_hash="sourceip" nic="bond0" iflabel="testing" unique_clone_address="true" \
        op monitor interval="60"
root at Condor:~# crm configure show cl_testIPs
clone cl_testIPs p_testIPs \
        meta globally-unique="true" clone-node-max="8" clone-max="8" target-role="Started"

root at Condor:~# crm_mon
<snip>
 Clone Set: cl_testIPs [p_testIPs] (unique)
     p_testIPs:0        (ocf::heartbeat:IPaddr2):       Started Vulture
     p_testIPs:1        (ocf::heartbeat:IPaddr2):       Started Vulture
     p_testIPs:2        (ocf::heartbeat:IPaddr2):       Started Vulture
     p_testIPs:3        (ocf::heartbeat:IPaddr2):       Started Vulture
     p_testIPs:4        (ocf::heartbeat:IPaddr2):       Started Vulture
     p_testIPs:5        (ocf::heartbeat:IPaddr2):       Started Vulture
     p_testIPs:6        (ocf::heartbeat:IPaddr2):       Started Vulture
     p_testIPs:7        (ocf::heartbeat:IPaddr2):       Started Vulture

root at Vulture:~# ip a s
<snip>
6: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
    link/ether 84:2b:2b:1a:bf:d6 brd ff:ff:ff:ff:ff:ff
    inet 192.168.2.42/22 brd 192.168.3.255 scope global bond0
    inet 192.168.2.104/29 brd 192.168.2.111 scope global bond0:testing
    inet 192.168.2.105/29 brd 192.168.2.111 scope global secondary bond0:testing
    inet 192.168.2.106/29 brd 192.168.2.111 scope global secondary bond0:testing
    inet 192.168.2.107/29 brd 192.168.2.111 scope global secondary bond0:testing
    inet 192.168.2.110/29 brd 192.168.2.111 scope global secondary bond0:testing
    inet 192.168.2.111/29 brd 192.168.2.111 scope global secondary bond0:testing
    inet 192.168.2.108/29 brd 192.168.2.111 scope global secondary bond0:testing
    inet 192.168.2.109/29 brd 192.168.2.111 scope global secondary bond0:testing
    inet6 fe80::862b:2bff:fe1a:bfd6/64 scope link
       valid_lft forever preferred_lft forever

You may want to look into the different options for clusterip_hash, nic, arp, and what to use for cidr_netmask for the IPaddr2 primitive

> That gave:
> 
> [root at FW1 ~]# crm status
> ============
> Last updated: Wed Jan 25 13:57:51 2012
> Last change: Wed Jan 25 13:57:05 2012 via cibadmin on FW1
> Stack: openais
> Current DC: FW1 - partition with quorum
> Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
> 2 Nodes configured, 2 expected votes
> 8 Resources configured.
> ============
> 
> Online: [ FW1 FW2 ]
> 
>  Clone Set: CloneIP.144.1 [ClusterIP.144.1] (unique)
>      ClusterIP.144.1:0  (ocf::heartbeat:IPaddr2):       Started FW1
>      ClusterIP.144.1:1  (ocf::heartbeat:IPaddr2):       Started FW2
>      ClusterIP.144.1:2  (ocf::heartbeat:IPaddr2):       Stopped
>      ClusterIP.144.1:3  (ocf::heartbeat:IPaddr2):       Stopped
>      ClusterIP.144.1:4  (ocf::heartbeat:IPaddr2):       Stopped
>      ClusterIP.144.1:5  (ocf::heartbeat:IPaddr2):       Stopped
>      ClusterIP.144.1:6  (ocf::heartbeat:IPaddr2):       Stopped
>      ClusterIP.144.1:7  (ocf::heartbeat:IPaddr2):       Stopped
> 
> But none of the IPs were pingable after running the clone (just with
> the primitive it was ok).
> doing:
> crm(live)# configure property stop-all-resources=false
> Didn't get the other IPs "Started".
> 
> So I got rid of this (successfully) and tried:
> 
> primitive ClusterIP.144.1 ocf:heartbeat:IPaddr2 params
> ip="10.144.1.1"
> cidr_netmask="32" clusterip_hash="sourceip"
> unique_clone_address="true" op monitor interval="120s"
> 
> But now I have:
> 
> crm(live)#  status
> ============
> Last updated: Wed Jan 25 14:57:42 2012
> Last change: Wed Jan 25 14:50:09 2012 via cibadmin on FW1
> Stack: openais
> Current DC: FW1 - partition with quorum
> Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
> 2 Nodes configured, 2 expected votes
> 1 Resources configured.
> ============
> 
> Online: [ FW1 FW2 ]
> 
>  ClusterIP.144.1        (ocf::heartbeat:IPaddr2):       Started FW1
> (unmanaged) FAILED

This shows the primitive is unmanaged - that means the stop all wont apply because pacemaker isn't managing the resource right now
Try:

crm(live)# resource
crm(live)resource# manage ClusterIP.144.1
crm(live)resource# up
crm(live)# configure
crm(live)configure# property stop-all-resources=true
crm(live)configure# commit

> 
> Failed actions:
>     ClusterIP.144.1_stop_0 (node=FW1, call=25, rc=6,
>     status=complete):
> not configured
> 
> And I can't delete it:
> crm(live)# configure property stop-all-resources=true
> crm(live)# configure commit
> INFO: apparently there is nothing to commit
> INFO: try changing something first
> crm(live)# configure erase
> WARNING: resource ClusterIP.144.1 is running, can't delete it
> ERROR: CIB erase aborted (nothing was deleted)
> 
> I can't work out how to move forward... Any pointers?
> Cheers
> Anton
> 
>