[Pacemaker] cib: ERROR: send_ais_message: Not connected to AIS

Mon Apr 14 12:58:18 UTC 2014

On Mon, 14 Apr 2014 14:40:43 +1000
Andrew Beekhof <andrew at beekhof.net> wrote:

> 
> On 11 Apr 2014, at 10:54 pm, Marco Felettigh <marco at nucleus.it> wrote:
> 
> > On Fri, 11 Apr 2014 17:17:57 +1000
> > Andrew Beekhof <andrew at beekhof.net> wrote:
> > 
> >> 
> >> On 8 Apr 2014, at 8:37 pm, marco at nucleus.it wrote:
> >> 
> >>> On Tue, 8 Apr 2014 10:49:16 +1000
> >>> Andrew Beekhof <andrew at beekhof.net> wrote:
> >>> 
> >>>> 
> >>>> On 7 Apr 2014, at 8:46 pm, marco at nucleus.it wrote:
> >>>> 
> >>>>> Hi,
> >>>>> in a production environment with 2 nodes ( nodeA , nodeB ) we
> >>>>> had an hardware failure so we restart the nodeB.
> >>>>> After the restarted nodeB came up we restart corosync/pacemaker
> >>>>> on it but for 2 days till now che corosync/pacemaker stuff is
> >>>>> looping.
> >>>>> 
> >>>>> crm_mon NodeA:
> >>>>> 
> >>>>> Stack: openais
> >>>>> Current DC: nodeA - partition with quorum
> >>>>> Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
> >>>>> 2 Nodes configured, 2 expected votes
> >>>>> 17 Resources configured.
> >>>>> ============
> >>>>> 
> >>>>> Online: [ nodeA ]
> >>>>> OFFLINE: [ nodeB ]
> >>>>> 
> >>>>> 
> >>>>> crm_mon NodeB:
> >>>>> 
> >>>>> Stack: openais
> >>>>> Current DC: NONE
> >>>>> 2 Nodes configured, 2 expected votes
> >>>>> 17 Resources configured.
> >>>>> ============
> >>>>> 
> >>>>> OFFLINE: [ nodeA nodeB ]
> >>>>> 
> >>>>> This loop on nodeB reports:
> >>>>> crmd: [7149]: debug: do_election_count_vote: Election 3 (owner:
> >>>>> nodeA) lost: vote from nodeA (Age)
> >>>>> 
> >>>>> So investigating around i found these message on nodeA:
> >>>>> cib: [28755]: ERROR: send_ais_message: Not connected to AIS
> >>>>> 
> >>>>> now this message is repeating for every operation.
> >>>>> Is it a corosync problem or a cib/pacemaker one ?
> >>>>> Any suggestion on what is happened ?
> >>>> 
> >>>> For some reason the cib can't connect to corosync anymore.
> >>>> No software got upgraded recently?
> >>>> 
> >>>> Are there any logs from corosync?
> >>>> Which distro is this?
> >>>> 
> >>>>> And why the start of a cluster node crasched the DC suff ? :(
> >>>>> 
> >>>>> 
> >>>>> Bye Marco
> >>>>> 
> >>>>> _______________________________________________
> >>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>>> 
> >>>>> Project Home: http://www.clusterlabs.org
> >>>>> Getting started:
> >>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
> >>>>> http://bugs.clusterlabs.org
> >>>> 
> >>> 
> >>> Hi,
> >>> the distro in an opensuse 11.1 and there is no updates also
> >>> because the distro is out of maintenance.
> >> 
> >> A good reason to be using SLES (or RHEL/CentOS).
> > 
> > Better Gentoo ;)
> > 
> >> 
> >>> We are planning and upgrade but the interesting thing is to figure
> >>> out the reasons of the problem.
> >>> The log in attachment, thanks for the support
> >> 
> >> There's nothing obvious in the logs.  Just that as far as pacemaker
> >> could tell, corosync suddenly went away. Was the corosync process
> >> still running?
> >> 
> > 
> > Yes , corosync was still running .
> 
> Stopping pacemaker and restarting it didnt help?
> 

At the end we restarted the two server and then start the
corosync/pacemaker stuff.

Thanks for the support
Marco