[Pacemaker] crmd restart due to internal error - pacemaker 1.1.8

Fri May 10 00:44:37 EDT 2013

On 10/05/2013, at 1:44 PM, pavan tc <pavan.tc at gmail.com> wrote:

> 
> 
> 
> On Fri, May 10, 2013 at 6:21 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
> 
> On 08/05/2013, at 9:16 PM, pavan tc <pavan.tc at gmail.com> wrote:
> 
> 
> Hi Andrew,
> 
> Thanks much for looking into this. I have some queries inline.
>  
> > Hi,
> >
> > I have a two-node cluster with STONITH disabled.
> 
> Thats not a good idea.
> 
> Ok. I'll try and configure stonith.
> 
> > I am still running with the pcmk plugin as opposed to the recommended CMAN plugin.
> 
> On rhel6?
> 
> Yes.
>  
> 
> >
> > With 1.1.8, I see some messages (appended to this mail) once in a while. I do not understand some keywords here - There is a "Leave" action. I am not sure what that is.
> 
> It means the cluster is not going to change the state of the resource.
> 
> Why did the cluster execute the "Leave" action at this point?

There is no "Leave" action being executed.  We are simply logging that nothing is going to happen to that resource - it is in the state that we exepect/want.

> Is there some other error that triggers this? Or is it a benign message?
> 
> 
> > And, there is a CIB update failure that leads to a RECOVER action. There is a message that says the RECOVER action is not supported. Finally this leads to a stop and start of my resource.
> 
> Well, and also Pacemaker's crmd process.
> My guess... the node is overloaded which is causing the cib queries to time out.
> 
> 
> Is there a cib query timeout value that I can set?

No.  You can set the batch-limit property though, this reduces the rate at which CIB operations are attempted

> I was earlier getting the TOTEM timeout.
> So, I set the token to a larger value (5 seconds) in corosync.conf and things were much better.
> But now, I have started hitting this problem.
> 
> Thanks,
> Pavan
> 
> > I can copy the "crm configure show" output, but nothing special there.
> >
> > Thanks much.
> > Pavan
> >
> > PS: The resource vha-bcd94724-3ec0-4a8d-8951-9d27be3a6acb is stale. The underlying device that represents this resource has been removed. However, the resource is still part of the CIB. All errors related to that resource can be ignored. But can this cause a node to be stopped/fenced?
> 
> Not if fencing is disabled.
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org