[Pacemaker] Pacemaker resource migration behaviour

Thu Feb 7 10:11:19 UTC 2013

Hi David,

I think you might be looking at the wrong part of the logs. I assume the line you meant was the following:

<1c>Feb  6 09:52:54 mu attrd[6256]:  warning: attrd_cib_callback: Update fail-count-sub-squid=(null) failed: No such device or address

Despite this failure, the recovery worked correctly and the resources were started then (as can be seen when examining the pe-input files 50-54).

What I had meant was the entire portion of the logs between 09:37:52 and 09:37:53 (pe-input files 46-49). There the state when the CRM returns to idle isn't that which ought to have been achieved given the transitions 106-109. I don't yet understand well enough how the CRM will decide whether to perform an action or not, additionally I can't seem to get any debug logs from pacemaker which might help with understanding why the CRM/LRM decides to do what it does.

Regards,
James

On Feb 6, 2013, at 8:14 PM, David Vossel <dvossel at redhat.com> wrote:

> 
> 
> ----- Original Message -----
>> From: "James Guthrie" <jag at open.ch>
>> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
>> Sent: Wednesday, February 6, 2013 6:52:07 AM
>> Subject: Re: [Pacemaker] Pacemaker resource migration behaviour
>> 
>> A quick addendum to this message:
>> 
>> The log files I provided actually continue until the resources do get
>> started on the host. The trigger for that is the 6-minute
>> failure-timeout timer that pops. As can be seen in pe-input-50, the
>> resources conntrackd, condition, sub-ospfd and sub-ripd are in slave
>> on both hosts and sub-squid is not started on either. This shows
>> that the desired end-state of the transitions produced with
>> pe-input-49 is never reached.
>> 
> 
> Yep, This looks like a bug in attrd.  I see the command going out to delete the fail-count for squid, but it fails. Since the fail-count isn't properly expired that sub-squid device can't start.
> 
> Can you open a bugs.clusterlabs.org issue for this please.  Include the logs. 
> 
> Thanks,
> -- Vossel
> 
>> James
>> 
>> On Feb 6, 2013, at 1:41 PM, James Guthrie <jag at open.ch> wrote:
>> 
>>> Hi David,
>>> 
>>> Unfortunately crm_report doesn't work correctly on my hosts as we
>>> have compiled from source with custom paths and apparently the
>>> crm_report and associated tools are not built to use the paths
>>> that can be customised with autoconf.
>>> 
>>> Despite that, I have done some investigation and think I may have
>>> found an inconsistency. I have attached the pacemaker-relevant
>>> syslog, including the pe-input files. The logfile starts where
>>> pacemaker detects that sub-squid is not running on mu. It then
>>> fails over to nu, where two further failures take place. In order
>>> to recover from these failures, the pengine produces transitions
>>> 106, 107, 108 and 109, with the corresponding pe-input files 46,
>>> 47, 48 and 49.
>>> 
>>> The way I understand it, pacemaker works through the transitions
>>> until something happens from outside, at which point the
>>> transitions are recalculated and pacemaker continues on.
>>> 
>>> Using crm_simulate to observe the transitions that should happen
>>> tells me that the transitions that were calculated from
>>> pe-input-49 ought to have resulted in the resources conntrackd,
>>> condition, sub-ospfd, sub-ripd and sub-squid being promote to
>>> master. In fact, this never happens, but the crmd reports the
>>> transition as being complete. It appears as though nowhere is it
>>> acknowledged that the current state is not the desired outcome as
>>> calculated by the pengine. Is it possible that this is a bug?
>>> 
>>> Regards,
>>> James
>>> 
>>> <pacemaker-not-starting-resources.tar.gz>
>>> On Feb 5, 2013, at 7:41 PM, David Vossel <dvossel at redhat.com>
>>> wrote:
>>> 
>>>> 
>>>> 
>>>> ----- Original Message -----
>>>>> From: "James Guthrie" <jag at open.ch>
>>>>> To: "The Pacemaker cluster resource manager"
>>>>> <pacemaker at oss.clusterlabs.org>
>>>>> Sent: Tuesday, February 5, 2013 8:12:57 AM
>>>>> Subject: Re: [Pacemaker] Pacemaker resource migration behaviour
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> as a follow-up to this, I realised that I needed to slightly
>>>>> change
>>>>> the way the resource constraints are put together, but I'm still
>>>>> seeing the same behaviour.
>>>>> 
>>> 
>>>>> Below are an excerpt from the logs on the host and the revised
>>>>> xml
>>>>> configuration. In this case, I caused two failures on the host
>>>>> mu,
>>>>> which forced the resources onto nu then I forced two failures on
>>>>> nu.
>>>>> What can be seen in the logs are the two detected failures on nu
>>>>> (the "warning: update_failcount:" lines). After the two failures
>>>>> on
>>>>> nu, the VIP is migrated back to mu, but none of the "support"
>>>>> resources are promoted with it.
>>>> 
>>>> I can't tell much from this output.
>>>> 
>>>> Run the steps you use to reproduce this and create a crm_report of
>>>> the issue so we can see both the logs and pengine transition
>>>> files that proceed this.
>>>> 
>>>> -- Vossel
>>>> 
>>>> 
>>>>> Regards,
>>>>> James
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> 
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org