[Pacemaker] corosync vs. pacemaker 1.1

Thu Jan 26 14:40:07 UTC 2012

Hi,

On 01/26/2012 02:13 PM, Dan Frincu wrote:
> Hi,
>
> On Wed, Jan 25, 2012 at 5:08 PM, Kiss Bence<bence at noc.elte.hu>  wrote:
>> Hi,
>>
>> I am newbie to the clustering and I am trying to build a two node
>> active/passive cluster based upon the documentation:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>
>> My systems are Fedora 14, uptodate. After forming the cluster as wrote, I
>> started to test it. (resources: drbd->  lvm->  fs ->group of services)
>> Resources moved around, nodes rebooted and killed (first I tried it in
>> virtual environment then also on real machines).
>>
>> After some events the two nodes ended up in a kind of state of split-brain.
>> The crm_mon showed me that the other node is offline at both nodes although
>> the drbd subsystem showed everything in sync and working. The network was
>> not the issue (ping, tcp and udp communications were fine). Nothing changed
>> from the network view.
>>
>> At first the rejoining took place quiet well, but some more events after it
>> took longer and after more event it didn't. The network dump showed me the
>> multicast packets still coming and going. At corosync (crm_node -l) the
>> other node didn't appeared both on them. After trying configuring the cib
>> logs was full of messages like "<the other node>: not in our membership".
>>
>> I tried to erase the config (crm configure erase, cibadmin -E -f) but it
>> worked only locally. I noticed that the pacemaker process didn't started up
>> normally on the node that was booting after the other. I also tried to
>> remove files from /var/lib/pengine/ and /var/lib/hearbeat/crm/ but only the
>> resources are gone. It didn't help on forming a cluster without resources.
>> The pacemaker process exited some 20 minutes after it started. Manual
>> starting was the same.
>>
>> After digging into google for answers I found nothing helpful. From running
>> tips I changed in the /etc/corosync/service.d/pcmk file the version to 1.1
>> (this is the version of the pacemaker in this distro). I realized that the
>> cluster processes were startup from corosync itself not by pacemaker. Which
>> could be omitted. The cluster forming is stable after this change even after
>> many many events.
>>
>> Now I reread the document mentioned above, and I wonder why it wrote the
>> "Important notice" on page 37. What is wrong theoretically with my scenario?
>> Why does it working? Why didn't work the config suggested by the document?
>>
>> Tests were done firsth on virtual machines of a Fedora 14 (1 CPU core, 512Mb
>> ram, 10G disk, 1G drbd on logical volume, physical  volume on drbd forming
>> volgroup named cluster.)/node.
>>
>> Then on real machines. They have more cpu cores (4), more RAM (4G) and more
>> disk (mirrored 750G), 180G drbd, and 100M garanteed routed link between the
>> nodes 5 hops away.
>>
>> By the way how should one configure the corosync to work on multicast routed
>> network? I had to create an openvpn tap link between the real nodes for
>> working. The original config with public IP-s didn't worked. Is corosync
>> equipped to cope with the multicast pim messages? Or it was a firewall
>> issue.
>
> First question, what versions of software are on each of the nodes?

Test bed nodes:

[root at virt1 ~]# corosync -v
Corosync Cluster Engine, version '1.4.2'
Copyright (c) 2006-2009 Red Hat, Inc.
[root at virt1 ~]# pacemakerd -$
Pacemaker 1.1.6-1.fc14
Written by Andrew Beekhof

[root at virt2 ~]# corosync -v;pacemakerd -$
Corosync Cluster Engine, version '1.4.2'
Copyright (c) 2006-2009 Red Hat, Inc.
Pacemaker 1.1.6-1.fc14
Written by Andrew Beekhof

Real nodes:

[root at ipa ~]# corosync -v;pacemakerd -$
Corosync Cluster Engine, version '1.4.2'
Copyright (c) 2006-2009 Red Hat, Inc.
Pacemaker 1.1.6-1.fc14
Written by Andrew Beekhof

[root at eta ~]# corosync -v;pacemakerd -$
Corosync Cluster Engine, version '1.4.2'
Copyright (c) 2006-2009 Red Hat, Inc.
Pacemaker 1.1.6-1.fc14
Written by Andrew Beekhof

>
> When using multicast, corosync doesn't care about "routing" the
> messages AFAIK, it relies on the network layer to do it's job. Now the
> "split-brain" you mention can take place due to network interruption,
> or due to missing or untested fencing as well.

I have created a testing environment for the cluster before going to 
manage real service by cluster software. The testbed is two node on the 
same machine in KVM virtualization on the same network mentioned above. 
There is no routing here. Everything is in L2.

>
> Second question, do you have fencing configured?

No, and with only one channel of communication (the net) I find it is 
not helpful. I thought of a third quorum node somewhere else outside the 
two building. If the network goes down so badly that at lease two of the 
tree node doesn't see each other the services may also down as no one 
can use it.

>
> You've mentioned 2(?) nodes "5 hops away", I'm guessing they're not in
> the same datacenter. If so, did you also test the latency on the
> network between endpoints? Also can you make sure PIM routing is
> enabled on all of the "hops" along the way?
>

The real servers (ipa, eta) are not even in the same building. Its a 
university campus. The network supports multicast routing I was told. 
Although with mtest.tgz (simple multicast test utility) I cannot state 
it is working well. This problem seems to bee local one. I ask help from 
the local netadmin.

The network latency:

With drbd is uptodate:
[root at ipa ~]# time ping eta -i .01 -q -s 1472 -c 2000
PING eta () 1472(1500) bytes of data.

--- eta ping statistics ---
2000 packets transmitted, 2000 received, 0% packet loss, time 17976ms
rtt min/avg/max/mdev = 0.483/0.492/1.350/0.031 ms

real	0m17.979s
user	0m2.004s
sys	0m14.857s

With drbd is in syncing:
[root at ipa ~]# time ping eta -i .01 -q -s 1472 -c 2000
PING eta () 1472(1500) bytes of data.

--- eta ping statistics ---
2000 packets transmitted, 2000 received, 0% packet loss, time 17987ms
rtt min/avg/max/mdev = 0.482/6.217/9.572/1.885 ms

real	0m18.038s
user	0m0.652s
sys	0m4.708s

> Your scenario seems to be a split-site, so you may be interested in
> https://github.com/jjzhang/booth as well.

Yes it is. At least the real one. But what about the testbed, the 
virtual machines on the same host? Aren't they suppose to work right as 
the document guides?

Thank You anyway! I will investigate this daemons development if I can 
use it.

Bence

>
> Regards,
> Dan
>
>>
>> Thanks in advance,
>> Bence
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
>