[ClusterLabs] Pacemaker issue when ethernet interface is pulled down

Sun Feb 14 14:01:48 UTC 2016

use fence and after you configured the fencing you need to use
iptables for testing your cluster, with iptables you can block 5404
and 5405 ports

2016-02-14 14:09 GMT+01:00 Debabrata Pani <Debabrata.Pani at mobileum.com>:
> Hi,
> We ran into some problems when we pull down the ethernet interface using
> “ifconfig eth0 down”
>
> Our cluster has the following configurations and resources
>
> Two  network interfaces : eth0 and lo(cal)
> 3 nodes with one node put in maintenance mode
> No-quorum-policy=stop
> Stonith-enabled=false
> Postgresql Master/Slave
> vip master and vip replication IPs
> VIPs will run on the node where Postgresql Master is running
>
>
> Two test cases that we executed are as follows
>
> Introduce delay in the ethernet interface o f the postgresql PRIMARY node
> (Command  : tc qdisc add dev eth0 root netem delay 8000ms)
> `Ifconfig eth0 down` on the postgresql PRIMARY Node
> We expected that both these test cases test for network problems in the
> cluster
>
>
> In the first case (ethernet interface delay)
>
> Cluster is divided into “partition WITH quorum” and “partition WITHOUT
> quorum”
> Partition WITHOUT quorum shuts down all the services
> Partition WITH quorum takes over as Postgresql PRIMARY and VIPs
> Everything as expected. Wow !
>
>
> In the second case (ethernet interface down)
>
> We see lots of errors like the following . On the node
>
> Feb 12 14:09:48 corosync [MAIN  ] Totem is unable to form a cluster because
> of an operating system or network fault. The most common cause of this
> message is that the local firewall is configured improperly.
> Feb 12 14:09:49 corosync [MAIN  ] Totem is unable to form a cluster because
> of an operating system or network fault. The most common cause of this
> message is that the local firewall is configured improperly.
> Feb 12 14:09:51 corosync [MAIN  ] Totem is unable to form a cluster because
> of an operating system or network fault. The most common cause of this
> message is that the local firewall is configured improperly.
>
> But the `crm_mon –Afr` (from the node whose eth0 is down)  always shows the
> cluster to be fully formed.
>
> It shows all the nodes as UP
> It shows itself as the one running the postgresql PRIMARY  (as was the case
> before putting the ethernet interface is down)
>
> `crm_mon -Afr` on the OTHER nodes show a different story
>
> They show the other node as down
> One of the other two nodes takes over the postgresql PRIMARY
>
> This leads to a split brain situation which was gracefully avoided in the
> test case where only “delay is introduced into the interface”
>
>
> Questions :
>
>  Is it a known issue with pacemaker when the ethernet interface is pulled
> down ?
> Is it an incorrect way of testing the cluster ? There is some information
> regarding the same in this thread
> http://www.gossamer-threads.com/lists/linuxha/pacemaker/59738
>
>
> Regards,
> Deba
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>


-- 
  .~.
  /V\
 //  \\
/(   )\
^`~'^