[Pacemaker] question about interface failover
Florian Crouzat
gentoo at floriancrouzat.net
Tue May 21 12:04:29 UTC 2013
Le 18/05/2013 20:23, christopher barry a écrit :
> On Fri, 2013-05-17 at 10:41 +0200, Florian Crouzat wrote:
>> Le 16/05/2013 21:45, christopher barry a écrit :
>>> Greetings,
>>>
>>> I've setup a new 2-node mysql cluster using
>>> * drbd 8.3.1.3
>>> * corosync 1.4.2
>>> * pacemaker 117
>>> on Debian Wheezy nodes.
>>>
>>> failover seems to be working fine for everything except the ips manually
>>> configured on the interfaces.
>>
>> This sentence makes no sense to me.
>> The cluster will not failover something that is not clusterized (a
>> 'manually' configured IP...)
>>
>> What are you trying to achieve exactly ?
>> Also, could you pastebin the output of "crm_mon -Arf1" I find it more
>> easy to read.
>>
>>
>>>
>>> see config here:
>>> http://pastebin.aquilenet.fr/?9eb51f6fb7d65fda#/YvSiYFocOzogAmPU9g
>>> +g09RcJvhHbgrY1JuN7D+gA4=
>>>
>>> If I bring down an interface, when the cluster restarts it, it only
>>> starts it with the vip - the original ip and route have been removed.
>>
>> Makes sense if you added the 'original' IP manually...
>> You should have non-VIP in /etc/sysconfig/network/ifcfg-*
>> But then again, please precise what you are trying to achieve.
>>
>>>
>>> not sure what to do to make sure the permanent ip and the routes get
>>> restored. I'm not all that versed on the cluster commandline yet, and
>>> I'm using LCMC for most of my usage.
>>
>>
>
> (@howard2.rjmetrics.com)-(14:00 / Sat May 18)
> [-][~]# crm_mon -Arf1
> ============
> Last updated: Sat May 18 14:00:27 2013
> Last change: Thu May 16 17:33:07 2013 via crm_attribute on
> howard3.rjmetrics.com
> Stack: openais
> Current DC: howard3.rjmetrics.com - partition with quorum
> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
> 2 Nodes configured, 2 expected votes
> 6 Resources configured.
> ============
>
> Online: [ howard3.rjmetrics.com howard2.rjmetrics.com ]
>
> Full list of resources:
>
> Master/Slave Set: ms_drbd_mysql [p_drbd_mysql]
> Masters: [ howard2.rjmetrics.com ]
> Slaves: [ howard3.rjmetrics.com ]
> Resource Group: g_mysql
> p_fs_mysql (ocf::heartbeat:Filesystem): Started
> howard2.rjmetrics.com
> ClusterPrivateIP (ocf::heartbeat:IPaddr2): Started
> howard2.rjmetrics.com
> ClusterPublicIP (ocf::heartbeat:IPaddr2): Started
> howard2.rjmetrics.com
> p_mysql (ocf::heartbeat:mysql): Started howard2.rjmetrics.com
>
> Node Attributes:
> * Node howard3.rjmetrics.com:
> + master-p_drbd_mysql:0 : 1000
> * Node howard2.rjmetrics.com:
> + master-p_drbd_mysql:1 : 10000
>
> Migration summary:
> * Node howard3.rjmetrics.com:
> p_drbd_mysql:1: migration-threshold=1000000 fail-count=1
> * Node howard2.rjmetrics.com:
> ClusterPublicIP: migration-threshold=1000000 fail-count=1
>
> Failed actions:
> p_drbd_mysql:1_promote_0 (node=howard3.rjmetrics.com, call=29,
> rc=-2, status=Timed Out): unknown exec error
> ClusterPublicIP_monitor_30000 (node=howard2.rjmetrics.com, call=122,
> rc=7, status=complete): not running
>
>
> howard2 and howard3 are the two clustered servers.
>
> During testing, when I ifdown either eth0 or eth1, the cluster starts
> the vip back up, but the other non-vip IPs and routes do not get
> started. I'm running Debian, so these are configured
> in /etc/network/interfaces. Saying 'manually' configured was misleading
> on my part, sorry about that.
Mhh, I cannot reproduce right now but I was pretty sure that IPaddr2
used "ip addr add X.X.X.X/YY dev ZZ" so I was expecting that ifdowning
device ZZ would prevent pacemaker to re-up the VIP as the underlaying
device doesn't exists anymore.
It's even proved by the fact that the non-vip doesn't come up again:
IPaddr2 doesn't ifup, it add an alias to an existing device.
See "sudo crm ra meta IPaddr2" and search for "nic="
Anyway, "ifdown" is not a valid use case to test your cluster, this
doesn't represent any possible valid production scenario.
>
> eth0 is the public interface, and eth1 is the private interface. eth2
> and eth3 are bonded as bond0, use jumbo frames, and are crossover cabled
> between the nodes.
>
> The test I was doing was to pull cables from eth0 and eth1, which hung
> the cluster. My assumption is that I need to add more configuration
> elements to manage the other IPs and also setup some ping hosts that
> when unreachable will initiate failover. What would help me I think is
> an example config or pointers to how to add these elements.
Well, without digging much in your configuration, you need ping-nodes
yes so that your most connected nodes "wins", and you also need fencing,
that is mandatory on any cluster.
Here's sample configuration for ping nodes and a location constraing so
that the most connected nodes hosts the resource "foo":
primitive ping-gw-sw1-sw2 ocf:pacemaker:ping \
params host_list="192.168.10.1 192.168.2.11 192.168.2.12"
dampen="35s" attempts="2" timeout="2" multiplier="100" \
op monitor interval="15s"
clone ping-nq-sw-swsec-clone ping-gw-sw1-sw2 \
meta target-role="Started"
location IPHA-on-connected-node foo \
rule $id="IPHA-on-connected-node-rule" pingd: defined pingd
See
http://www.hastexo.com/resources/hints-and-kinks/network-connectivity-check-pacemaker
>
> On another note, the test made the drbd link disconnect, with both disks
> now marked as standalone in the lcmc gui. Right-clicking the disks or
> the conenction does not allow any action other than view logs, which
> say:
>
> May 16 17:33:08 howard3 kernel: [781360.146362] block drbd0: Split-Brain
> detected but unresolved, dropping connection!
> May 16 17:33:08 howard3 kernel: [781360.146451] block drbd0: helper
> command: /sbin/drbdadm split-brain minor-0
> May 16 17:33:08 howard3 kernel: [781360.149042] block drbd0: helper
> command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0)
> May 16 17:33:08 howard3 kernel: [781360.149051] block drbd0:
> conn( WFReportParams -> Disconnecting )
> May 16 17:33:08 howard3 kernel: [781360.149060] block drbd0: error
> receiving ReportState, l: 4!
> May 16 17:33:08 howard3 kernel: [781360.149154] block drbd0: asender
> terminated
> May 16 17:33:08 howard3 kernel: [781360.149159] block drbd0: Terminating
> drbd0_asender
> May 16 17:33:08 howard3 kernel: [781360.149609] block drbd0: Connection
> closed
> May 16 17:33:08 howard3 kernel: [781360.149619] block drbd0:
> conn( Disconnecting -> StandAlone )
> May 16 17:33:08 howard3 kernel: [781360.149811] block drbd0: receiver
> terminated
> May 16 17:33:08 howard3 kernel: [781360.149815] block drbd0: Terminating
> drbd0_receiver
>
> I'm really not sure how to proceed. Please let me know any additional
> information you may need.
I know nothing about shared storage.
>
> Thanks for your time Florian, it's much appreciated.
>
You'r welcome.
--
Cheers,
Florian Crouzat
More information about the Pacemaker
mailing list