[Pacemaker] node offline after fencing (pacemakerd hangs)

Thu Jul 19 14:05:15 UTC 2012

----- Original Message -----
> From: "Raoul Bhatia [IPAX]" <r.bhatia at ipax.at>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Wednesday, July 18, 2012 10:12:14 AM
> Subject: Re: [Pacemaker] node offline after fencing (pacemakerd hangs)
> 
> On 2012-07-18 15:57, Ulrich Leodolter wrote:
> > hi,
> > 
> > after adding a second ring to corosync.conf
> > the problem seems to be gone.
> > 
> > after killing corosync the node is fenced by
> > the other node.  after reboot the cluster is
> > fully operational.
> > 
> > is this essential to have at least 2 rings?
> > 
> > maybe there is a network timing problem (but can't see
> > error messages)
> > the interface on ring 0 (192.168.20.171) is a bridge.
> > the interface on ring 1 (10.10.10.171) is normal ethernet
> > interface.
> 
> I've seen such things with bonding devices under debian 6.0
> 
> try something like:
> > auto bond0
> > iface bond0 inet static
> ...
> >
>         bond-mode active-backup
>         bond-miimon 100
>         bridge_fd 0
>         bridge_maxwait 0
> 
> Another workaround is a "sleep 10" or similar at the beginning
> of the pacemaker script to let bond0 come up.

Same here under Ubuntu - more specifically with OCFS2/dlm under Pacemaker and autostarting on boot - however same sort of problems.

Another solution is something like (will vary a little in RHEL I believe):

Disable corosync autostart
$sudo update-rc.d -f corosync disable S 

add 'post-up /etc/init.d/corosync start' to bonding (or in your case bridged) interface in 
/etc/network/interfaces.

^^^^ From:
http://www.gossamer-threads.com/lists/engine?do=post_view_flat;post=63617;page=1;sb=post_latest_reply;so=ASC;mh=25;list=linuxha

That way corosync wont start until the interfaces/bridge are actually up.

> 
> We always go with 2 rings, even when using a NIC bonding.

+1
We use 2 rings, each on a different bond.

HTH
Jake