[Pacemaker] corosync vs. pacemaker 1.1

Fri Feb 10 15:46:03 UTC 2012

Hi,

On 01/30/2012 04:00 AM, Andrew Beekhof wrote:
> On Thu, Jan 26, 2012 at 2:08 AM, Kiss Bence<bence at noc.elte.hu>  wrote:
>> Hi,
>>
>> I am newbie to the clustering and I am trying to build a two node
>> active/passive cluster based upon the documentation:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>
>> My systems are Fedora 14, uptodate. After forming the cluster as wrote, I
>> started to test it. (resources: drbd->  lvm->  fs ->group of services)
>> Resources moved around, nodes rebooted and killed (first I tried it in
>> virtual environment then also on real machines).
>>
>> After some events the two nodes ended up in a kind of state of split-brain.
>> The crm_mon showed me that the other node is offline at both nodes although
>> the drbd subsystem showed everything in sync and working. The network was
>> not the issue (ping, tcp and udp communications were fine). Nothing changed
>> from the network view.
>>
>> At first the rejoining took place quiet well, but some more events after it
>> took longer and after more event it didn't. The network dump showed me the
>> multicast packets still coming and going. At corosync (crm_node -l) the
>> other node didn't appeared both on them. After trying configuring the cib
>> logs was full of messages like "<the other node>: not in our membership".
>
> That looks like a pacemaker bug.
> Can you use crm_report to grab logs from about 30 minutes prior to the
> first time you see this log until an hour after please?
>
> Attach that to a bug in bugs.clusterlabs.org and i'll take a look

I had created a bug report: id 5031.

The "split-brain" is lasting every time about 5 minutes. Meanwhile the 
two nodes think that the other node is dead. However the drbd is working 
fine, and properly disallowing the second rebooted node to go Primary. 
The crm_node -l shows only the local node.

Meanwhile one of my question is answered. The multicast issue was a 
local network issue. The local netadmin fixed it. Now it works.

This issue seems to me similar to what James Flatten had reported at 
8-th Feb. ([Pacemaker] Question about cluster start-up in a 2 node 
cluster	with a node offline.)

The stonith-enabled="false" \
and no-quorum-policy="ignore"

Thanks in advance,
Bence

>
>>
>> I tried to erase the config (crm configure erase, cibadmin -E -f) but it
>> worked only locally. I noticed that the pacemaker process didn't started up
>> normally on the node that was booting after the other. I also tried to
>> remove files from /var/lib/pengine/ and /var/lib/hearbeat/crm/ but only the
>> resources are gone. It didn't help on forming a cluster without resources.
>> The pacemaker process exited some 20 minutes after it started. Manual
>> starting was the same.
>>
>> After digging into google for answers I found nothing helpful. From running
>> tips I changed in the /etc/corosync/service.d/pcmk file the version to 1.1
>> (this is the version of the pacemaker in this distro). I realized that the
>> cluster processes were startup from corosync itself not by pacemaker. Which
>> could be omitted. The cluster forming is stable after this change even after
>> many many events.
>>
>> Now I reread the document mentioned above, and I wonder why it wrote the
>> "Important notice" on page 37. What is wrong theoretically with my scenario?
>
> Having corosync start the daemons worked well for some but not others,
> thus it was unreliable.
> The notice points out a major difference between the two operating
> modes so that people will not be caught by surprise when pacemaker
> does not start.
>
>> Why does it working? Why didn't work the config suggested by the document?
>>
>> Tests were done firsth on virtual machines of a Fedora 14 (1 CPU core, 512Mb
>> ram, 10G disk, 1G drbd on logical volume, physical  volume on drbd forming
>> volgroup named cluster.)/node.
>>
>> Then on real machines. They have more cpu cores (4), more RAM (4G) and more
>> disk (mirrored 750G), 180G drbd, and 100M garanteed routed link between the
>> nodes 5 hops away.
>>
>> By the way how should one configure the corosync to work on multicast routed
>> network? I had to create an openvpn tap link between the real nodes for
>> working. The original config with public IP-s didn't worked. Is corosync
>> equipped to cope with the multicast pim messages? Or it was a firewall
>> issue.
>>
>> Thanks in advance,
>> Bence
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org