[Pacemaker] Info on failcount automatic reset

Wed Jun 25 10:57:17 CEST 2014

On Wed, Jun 25, 2014 at 1:28 AM, Andrew Beekhof <andrew at beekhof.net> wrote:

>
> > SO it seems at midnight the resource already was with a failcount of 2
> (perhaps caused by problems happened weeks ago..?) and then at 03:38 got a
> timeout on monitoring its state and was relocated...
> >
> > pacemaker is at 1.1.6-1.27.26
>
> I don't think the automatic reset was part of 1.1.6.
> The documentation you're referring to is probably SLES12 specific.
>
> > and I see this list message that seems related:
> > http://oss.clusterlabs.org/pipermail/pacemaker/2012-August/015076.html
> >
> > Is it perhaps only a matter of setting meta parameter
> > failure-timeout
> > as explained in High AvailabilityGuide:
> >
> https://www.suse.com/documentation/sle_ha/singlehtml/book_sleha/book_sleha.html#sec.ha.config.hawk.rsc
> >
> > in particular
> > 5.3.6. Specifying Resource Failover Nodes
> > ...
> > 4. If you want to automatically expire the failcount for a resource, add
> the failure-timeout meta attribute to the resource as described in
> Procedure 5.4: Adding Primitive Resources, Step 7 and enter a Value for the
> failure-timeout.
> > ..
> > ?
>

Yes, your are right. it seems that starting from here:
https://www.suse.com/it-it/documentation/sles11/
or here
https://www.suse.com/documentation/sles11/

the SLES 11 html links for "SUSE Linux Enterprise High Availability
Extension Guide" erroneously point to SLES 12 anyway...
Tried to select "feedback" button at bottom but it doesn't work (at least
on my chrome browser on Fedora 20) for niether the italy one not the
english one...

Going through pdf docments I already downloaded before, I still have this
for SLES 11 SP2 as the system in object

"
5.3.5 Specifying Resource Failover Nodes
...
A resource will be automatically restarted if it fails. If that cannot be
achieved on the
current node, or it fails N times on the current node, it will try to fail
over to another
node. You can define a number of failures for resources (a
migration-threshold),
after which they will migrate to a new node. If you have more than two
nodes in your
cluster, the node a particular resource fails over to is chosen by the High
Availability
software.
However, you can specify the node a resource will fail over to by
proceeding as follows:
1 Configure a location constraint for that resource as described in
Procedure 5.6,
“Adding or Modifying Locational Constraints” (page 86).
2 Add the migration-threshold meta attribute to that resource as described
in
Procedure 5.3, “Adding or Modifying Meta and Instance Attributes” (page 82)
and
enter a Value for the migration-threshold. The value should be positive and
less that
INFINITY.
3 If you want to automatically expire the failcount for a resource, add the
failure-timeout meta attribute to that resource as described in Procedure
5.3,
“Adding or Modifying Meta and Instance Attributes” (page 82) and enter a
Value
for the failure-timeout.
4 If you want to specify additional failover nodes with preferences for a
resource,
create additional location constraints.
"

So the question remains about "failure-timeout" parameter and/or other
methods to solve/mitigate what I described in my first message.

Thanks,
Gianluca
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20140625/532cbe7c/attachment.html>