[Pacemaker] Resource does not auto recover from failed state

Mon Sep 2 00:34:52 UTC 2013

On 27/08/2013, at 6:32 PM, tetsuo shima <tetsuo.41.shima at gmail.com> wrote:

> Hi list !
> 
> I'm having an issue with corosync, here is the scenario :
> 
> # crm_mon -1
> ============
> Last updated: Tue Aug 27 09:50:13 2013
> Last change: Mon Aug 26 16:06:01 2013 via cibadmin on node2
> Stack: openais
> Current DC: node1 - partition with quorum
> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
> 2 Nodes configured, 2 expected votes
> 3 Resources configured.
> ============
> 
> Online: [ node2 node1 ]
> 
>  ip    (ocf::heartbeat:IPaddr2):    Started node1
>  Clone Set: mysql-mm [mysql] (unmanaged)
>      mysql:0    (ocf::heartbeat:mysql):    Started node1 (unmanaged)
>      mysql:1    (ocf::heartbeat:mysql):    Started node2 (unmanaged)
> 
> # /etc/init.d/mysql stop
> [ ok ] Stopping MySQL database server: mysqld.
> 
> # crm_mon -1
> ============
> Last updated: Tue Aug 27 09:50:30 2013
> Last change: Mon Aug 26 16:06:01 2013 via cibadmin on node2
> Stack: openais
> Current DC: node1 - partition with quorum
> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
> 2 Nodes configured, 2 expected votes
> 3 Resources configured.
> ============
> 
> Online: [ node2 node1 ]
> 
>  ip    (ocf::heartbeat:IPaddr2):    Started node1
>  Clone Set: mysql-mm [mysql] (unmanaged)
>      mysql:0    (ocf::heartbeat:mysql):    Started node1 (unmanaged)
>      mysql:1    (ocf::heartbeat:mysql):    Started node2 (unmanaged) FAILED
> 
> Failed actions:
>     mysql:0_monitor_15000 (node=node2, call=27, rc=7, status=complete): not running
> 
> # /etc/init.d/mysql start
> [ ok ] Starting MySQL database server: mysqld ..
> [info] Checking for tables which need an upgrade, are corrupt or were 
> not closed cleanly..
> 
> # sleep 60 && crm_mon -1
> ============
> Last updated: Tue Aug 27 09:51:54 2013
> Last change: Mon Aug 26 16:06:01 2013 via cibadmin on node2
> Stack: openais
> Current DC: node1 - partition with quorum
> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
> 2 Nodes configured, 2 expected votes
> 3 Resources configured.
> ============
> 
> Online: [ node2 node1 ]
> 
>  ip    (ocf::heartbeat:IPaddr2):    Started node1
>  Clone Set: mysql-mm [mysql] (unmanaged)
>      mysql:0    (ocf::heartbeat:mysql):    Started node1 (unmanaged)
>      mysql:1    (ocf::heartbeat:mysql):    Started node2 (unmanaged) FAILED
> 
> Failed actions:
>     mysql:0_monitor_15000 (node=node2, call=27, rc=7, status=complete): not running
> 
> As you can see, every time I stop Mysql (which is unmanaged), the resource is marked as failed :
> 
> crmd: [1828]: info: process_lrm_event: LRM operation mysql:0_monitor_15000 (call=4, rc=7, cib-update=10, confirmed=false) not running
> 
> When I restart the resource :
> 
> crmd: [1828]: info: process_lrm_event: LRM operation mysql:0_monitor_15000 (call=4, rc=0, cib-update=11, confirmed=false) ok
> 
> The resource is still in failed state and does not recover until I manually clean up the resource.

Older versions did that. Try 1.1.10

> 
> # crm_mon --one-shot --operations
> ============
> Last updated: Tue Aug 27 10:17:30 2013
> Last change: Mon Aug 26 16:06:01 2013 via cibadmin on node2
> Stack: openais
> Current DC: node1 - partition with quorum
> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
> 2 Nodes configured, 2 expected votes
> 3 Resources configured.
> ============
> 
> Online: [ node2 node1 ]
> 
>  ip    (ocf::heartbeat:IPaddr2):    Started node1
>  Clone Set: mysql-mm [mysql] (unmanaged)
>      mysql:0    (ocf::heartbeat:mysql):    Started node1 (unmanaged)
>      mysql:1    (ocf::heartbeat:mysql):    Started node2 (unmanaged) FAILED
> 
> Operations:
> * Node node1: 
>    ip: migration-threshold=1
>     + (57) probe: rc=0 (ok)
>    mysql:0: migration-threshold=1 fail-count=1
>     + (58) probe: rc=0 (ok)
>     + (59) monitor: interval=15000ms rc=0 (ok)
> * Node node2: 
>    mysql:0: migration-threshold=1 fail-count=3
>     + (27) monitor: interval=15000ms rc=7 (not running)
>     + (27) monitor: interval=15000ms rc=0 (ok)
> 
> Failed actions:
>     mysql:0_monitor_15000 (node=node2, call=27, rc=7, status=complete): not running
> 
> ---
> 
> Here is some details about my configuration :
> 
> # cat /etc/debian_version 
> 7.1
> 
> # dpk# dpkg -l | grep corosync
> ii  corosync                         1.4.2-3                       amd64        Standards-based cluster framework 
> 
> # dpkg -l | grep pacem   
> ii  pacemaker                        1.1.7-1                       amd64        HA cluster resource manager
> 
> # crm configure show
> node node2 \
>     attributes standby="off"
> node node1
> primitive ip ocf:heartbeat:IPaddr2 \
>     params ip="192.168.0.20" cidr_netmask="255.255.0.0" nic="eth2.2755" iflabel="mysql" \
>     meta is-managed="true" target-role="Started" \
>     meta resource-stickiness="100"
> primitive mysql ocf:heartbeat:mysql \
>     op monitor interval="15" timeout="30"
> clone mysql-mm mysql \
>     meta is-managed="false"
> location cli-prefer-ip ip 50: node1
> colocation ip-on-mysql-mm 200: ip mysql-mm
> property $id="cib-bootstrap-options" \
>     dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
>     cluster-infrastructure="openais" \
>     expected-quorum-votes="2" \
>     stonith-enabled="false" \
>     no-quorum-policy="ignore" \
>     last-lrm-refresh="1377513557" \
>     start-failure-is-fatal="false"
> rsc_defaults $id="rsc-options" \
>     resource-stickiness="1" \
>     migration-threshold="1"
> 
> ---
> 
> Does anyone know what is wrong with my configuration ?
> 
> Thanks for the help,
> 
> Best regards.
> 
>  
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130902/ee957e1b/attachment-0003.sig>