[Pacemaker] Problems after updating from debian squeeze to wheezy

Mon Jul 29 16:46:38 UTC 2013

Hi all,

I have a little problem here and would like to get some help:

I have (had?) a working three-node cluster of two active nodes (nebel1 
and nebel2) and one standby-node (nebel3) running debian squeeze + 
backports. That is pacemaker 1.1.7-1~bpo60+1 and corosync 
1.4.2-1~bpo60+1.

Now I updated the standby-node nebel3 to debian wheezy which went 
without problems itself. But as neither the versions of pacemaker and 
corosync changed, I expected the updated nebel3 to join the original 
cluster again. Little did I know... So while nebel3 has pacemaker 
1.1.7-1 and corosync 1.4.2-3, it seems something in the update broke it. 
/etc/corosync/corosync.conf is still the same on all nodes.

I suspect the problem is somewhere in corosync as nebel1 and nebel2 
only see each other:

$ ssh root at nebel2 --  corosync-objctl |grep member
runtime.totem.pg.mrp.srp.members.33648138.ip=r(0) ip(10.110.1.2) r(1) 
ip(10.112.0.2)
runtime.totem.pg.mrp.srp.members.33648138.join_count=1
runtime.totem.pg.mrp.srp.members.33648138.status=joined
runtime.totem.pg.mrp.srp.members.16870922.ip=r(0) ip(10.110.1.1) r(1) 
ip(10.112.0.1)
runtime.totem.pg.mrp.srp.members.16870922.join_count=1
runtime.totem.pg.mrp.srp.members.16870922.status=joined
runtime.totem.pg.mrp.srp.members.50425354.ip=r(0) ip(10.110.1.3) r(1) 
ip(10.112.0.3)
runtime.totem.pg.mrp.srp.members.50425354.join_count=39
runtime.totem.pg.mrp.srp.members.50425354.status=left

nebel3 on the other hand:

$ ssh root at nebel3 --  corosync-objctl |grep member
runtime.totem.pg.mrp.srp.members.50425354.ip=r(0) ip(10.110.1.3) r(1) 
ip(10.112.0.3)
runtime.totem.pg.mrp.srp.members.50425354.join_count=1
runtime.totem.pg.mrp.srp.members.50425354.status=joined

Both nebel2 and nebel3 think the communication-rings are free of 
faults:

$ ssh root at nebel2 --  corosync-cfgtool -s
Printing ring status.
Local node ID 33648138
RING ID 0
         id      = 10.110.1.2
         status  = ring 0 active with no faults
RING ID 1
         id      = 10.112.0.2
         status  = ring 1 active with no faults
$ ssh root at nebel3 --  corosync-cfgtool -s
Printing ring status.
Local node ID 50425354
RING ID 0
         id      = 10.110.1.3
         status  = ring 0 active with no faults
RING ID 1
         id      = 10.112.0.3
         status  = ring 1 active with no faults

I can ping all the participating nodes via all their connections and 
IPs from all nodes

The corosync.log on nebel2 doesn't mention nebel3 after it leaving the 
cluster for reboot after the update. Likewise the corosync.log on nebel3 
doesn't mention nebel2 and nebel1 anymore.

So, what did I miss during the update? How can I get nebel3 to join 
back into the original cluster instead of forming its own 1-out-of-3 
cluster (with the same resources defined)?

Any helps is highly appreciated!

- Arnold