[Pacemaker] migration-threshold question

Fri Apr 3 11:36:56 UTC 2009

On Sat, Mar 21, 2009 at 17:04, Juha Heinanen <jh at tutpro.com> wrote:
> i have a resource that used to have this crm definition:
>
> primitive test lsb:test \
>        op monitor interval="30s" timeout="5s" \
>        meta target-role="Started"
>
> if i stopped the resource by
>
> /etc/init.d/test stop
>
> pacemaker restarted as i was expecting it to do.
>
> then i modified "test" init script so that starting of the resource
> always failed.  the result was that pacemaker kept on trying to restart
> it forever without migrating the group of primitives of which "test" is
> the last member to the other node.
>
> i searched archives and found about parameter migration-threshold:
>
>  If you used pacemaker 1.0 you would not have to deal with
>  failure-stickiness anymore, but could use the very nice new
>  "migration-threshold" feature. Set this to 1 and after 1 failure, the
>  resource will failover, regardless of its score.
>
> so i went and set migration-threshold to value 3 hoping that after three
> failed attempts to restart the resource the group would migrate to the
> other node:

nope - starts are considered special
by default they count as INFINITY failures (as in, they (should)
immediately cause the node to no longer be allowed to run the
resources).

if you want them to count as only 1 you need to run:
   crm_attribute -n start-failure-is-fatal -v false

> primitive test lsb:test \
>        op monitor interval="30s" timeout="5s" \
>        meta target-role="Started" migration-threshold="3"
>
> the result, however, was that after 3 restart attempts, the resource
> has stayed "Stopped" on the node where it failed:
>
> ============
> Last updated: Sat Mar 21 19:02:07 2009
> Current DC: lenny2 (f13aff7b-6c94-43ac-9a24-b118e62d5325)
> Version: 1.0.2-ec6b0bbee1f3aa72c4c2559997e675db6ab39160
> 2 Nodes configured.
> 2 Resources configured.
> ============
>
> Node: lenny1 (8df8447f-6ecf-41a7-a131-c89fd59a120d): online
> Node: lenny2 (f13aff7b-6c94-43ac-9a24-b118e62d5325): online
>
> Master/Slave Set: ms-drbd0
>    drbd0:0     (ocf::heartbeat:drbd):  Master lenny1
>    drbd0:1     (ocf::heartbeat:drbd):  Slave lenny2
> Resource Group: sip-proxy-group
>    fs0 (ocf::heartbeat:Filesystem):    Started lenny1
>    mysql-server        (lsb:mysql):    Started lenny1
>    radius-server       (lsb:freeradius):       Started lenny1
>    virtual-ip  (ocf::heartbeat:IPaddr2):       Started lenny1
>    test        (lsb:test):     Stopped
>
> Failed actions:
>    test_monitor_30000 (node=lenny1, call=30, rc=7, status=complete): not running
>
> the question:  what i'm missing here, i.e., what should add to the crm
> config in order to get the group migrated to the other node if
> restarting of "test" fails 3 times?
>
> -- juha
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>