[Pacemaker] Cannot start VirtualDomain resource after restart

Wed Jun 20 16:01:56 CEST 2012

On Wed, 20 Jun 2012, emmanuel segura wrote:

> Why you say there is not error in the message
> =========================================================
> Jun 20 11:57:25 atlas4 lrmd: [17568]: info: operation monitor[35] on lx0
> for client 17571: pid 30179 exited with return code 7
> Jun 20 11:57:25 atlas4 crmd: [17571]: debug: create_operation_update:
> do_update_resource: Updating resouce lx0 after complete monitor op
> (interval=0)
> Jun 20 11:57:25 atlas4 crmd: [17571]: info: process_lrm_event: LRM
> operation lx0_monitor_0 (call=35, rc=7, cib-update=61, confirmed=true) not
> running

I interpreted those lines as a checking that the resource hasn't been 
started yet (confirmed=true). And indeed, it's not running so the return 
code is OCF_NOT_RUNNING.

There's no log message about an attempt to start the resource.

Best regards,
Jozsef

> 2012/6/20 Kadlecsik József <kadlecsik.jozsef at wigner.mta.hu>
>       Hello,
> 
>       Somehow a VirtualDomain resource after a "crm resource restart",
>       which did
>       *not* start the resource but just stop, the resource cannot be
>       started
>       anymore. The most baffling is that I do not see an error
>       message. The
>       resource in question, named 'lx0', can be started directly via
>       virsh/libvirt and libvirtd is running on all cluster nodes.
> 
>       We run corosync 1.4.2-1~bpo60+1, pacemaker 1.1.6-2~bpo60+1
>       (debian).
> 
>       # crm status
>       ============
>       Last updated: Wed Jun 20 15:14:44 2012
>       Last change: Wed Jun 20 14:07:40 2012 via cibadmin on atlas0
>       Stack: openais
>       Current DC: atlas0 - partition with quorum
>       Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
>       7 Nodes configured, 7 expected votes
>       18 Resources configured.
>       ============
> 
>       Online: [ atlas0 atlas1 atlas2 atlas3 atlas4 atlas5 atlas6 ]
> 
>        kerberos       (ocf::heartbeat:VirtualDomain): Started atlas0
>        stonith-atlas3 (stonith:ipmilan):      Started atlas4
>        stonith-atlas1 (stonith:ipmilan):      Started atlas4
>        stonith-atlas2 (stonith:ipmilan):      Started atlas4
>        stonith-atlas0 (stonith:ipmilan):      Started atlas4
>        stonith-atlas4 (stonith:ipmilan):      Started atlas3
>        mailman        (ocf::heartbeat:VirtualDomain): Started atlas6
>        indico (ocf::heartbeat:VirtualDomain): Started atlas0
>        papi   (ocf::heartbeat:VirtualDomain): Started atlas1
>        wwwd   (ocf::heartbeat:VirtualDomain): Started atlas2
>        webauth        (ocf::heartbeat:VirtualDomain): Started atlas3
>        caladan        (ocf::heartbeat:VirtualDomain): Started atlas4
>        radius (ocf::heartbeat:VirtualDomain): Started atlas5
>        mail0  (ocf::heartbeat:VirtualDomain): Started atlas6
>        stonith-atlas5 (stonith:apcmastersnmp):        Started atlas4
>        stonith-atlas6 (stonith:apcmastersnmp):        Started atlas4
>        w0     (ocf::heartbeat:VirtualDomain): Started atlas2
> 
>       # crm resource show
>        kerberos       (ocf::heartbeat:VirtualDomain) Started
>        stonith-atlas3 (stonith:ipmilan) Started
>        stonith-atlas1 (stonith:ipmilan) Started
>        stonith-atlas2 (stonith:ipmilan) Started
>        stonith-atlas0 (stonith:ipmilan) Started
>        stonith-atlas4 (stonith:ipmilan) Started
>        mailman        (ocf::heartbeat:VirtualDomain) Started
>        indico (ocf::heartbeat:VirtualDomain) Started
>        papi   (ocf::heartbeat:VirtualDomain) Started
>        wwwd   (ocf::heartbeat:VirtualDomain) Started
>        webauth        (ocf::heartbeat:VirtualDomain) Started
>        caladan        (ocf::heartbeat:VirtualDomain) Started
>        radius (ocf::heartbeat:VirtualDomain) Started
>        mail0  (ocf::heartbeat:VirtualDomain) Started
>        stonith-atlas5 (stonith:apcmastersnmp) Started
>        stonith-atlas6 (stonith:apcmastersnmp) Started
>        w0     (ocf::heartbeat:VirtualDomain) Started
>        lx0    (ocf::heartbeat:VirtualDomain) Stopped
> 
>       # crm configure show
>       node atlas0 \
>              attributes standby="false" \
>              utilization memory="24576"
>       node atlas1 \
>              attributes standby="false" \
>              utilization memory="24576"
>       node atlas2 \
>              attributes standby="false" \
>              utilization memory="24576"
>       node atlas3 \
>              attributes standby="false" \
>              utilization memory="24576"
>       node atlas4 \
>              attributes standby="false" \
>              utilization memory="24576"
>       node atlas5 \
>              attributes standby="off" \
>              utilization memory="20480"
>       node atlas6 \
>              attributes standby="off" \
>              utilization memory="20480"
>       primitive caladan ocf:heartbeat:VirtualDomain \
>              params config="/etc/libvirt/crm/caladan.xml"
>       hypervisor="qemu:///system" \
>              meta allow-migrate="true" target-role="Started"
>       is-managed="true" \
>              op start interval="0" timeout="120s" \
>              op stop interval="0" timeout="120s" \
>              op monitor interval="10s" timeout="40s" depth="0" \
>              op migrate_to interval="0" timeout="240s" on-fail="block"
>       \
>              op migrate_from interval="0" timeout="240s"
>       on-fail="block" \
>              utilization memory="4608"
>       primitive indico ocf:heartbeat:VirtualDomain \
>              params config="/etc/libvirt/crm/indico.xml"
>       hypervisor="qemu:///system" \
>              meta allow-migrate="true" target-role="Started"
>       is-managed="true" \
>              op start interval="0" timeout="120s" \
>              op stop interval="0" timeout="120s" \
>              op monitor interval="10s" timeout="40s" depth="0" \
>              op migrate_to interval="0" timeout="240s" on-fail="block"
>       \
>              op migrate_from interval="0" timeout="240s"
>       on-fail="block" \
>              utilization memory="5120"
>       primitive kerberos ocf:heartbeat:VirtualDomain \
>              params config="/etc/libvirt/qemu/kerberos.xml"
>       hypervisor="qemu:///system" \
>              meta allow-migrate="true" target-role="Started"
>       is-managed="true" \
>              op start interval="0" timeout="120s" \
>              op stop interval="0" timeout="120s" \
>              op monitor interval="10s" timeout="40s" depth="0" \
>              op migrate_to interval="0" timeout="240s" on-fail="block"
>       \
>              op migrate_from interval="0" timeout="240s"
>       on-fail="block" \
>              utilization memory="4608"
>       primitive lx0 ocf:heartbeat:VirtualDomain \
>              params config="/etc/libvirt/crm/lx0.xml"
>       hypervisor="qemu:///system" \
>              meta allow-migrate="true" target-role="Started"
>       is-managed="true" \
>              op start interval="0" timeout="120s" \
>              op stop interval="0" timeout="120s" \
>              op monitor interval="10s" timeout="40s" depth="0" \
>              op migrate_to interval="0" timeout="240s" on-fail="block"
>       \
>              op migrate_from interval="0" timeout="240s"
>       on-fail="block" \
>              utilization memory="4608"
>       primitive mail0 ocf:heartbeat:VirtualDomain \
>              params config="/etc/libvirt/crm/mail0.xml"
>       hypervisor="qemu:///system" \
>              meta allow-migrate="true" target-role="Started"
>       is-managed="true" \
>              op start interval="0" timeout="120s" \
>              op stop interval="0" timeout="120s" \
>              op monitor interval="10s" timeout="40s" depth="0" \
>              op migrate_to interval="0" timeout="240s" on-fail="block"
>       \
>              op migrate_from interval="0" timeout="240s"
>       on-fail="block" \
>              utilization memory="4608"
>       primitive mailman ocf:heartbeat:VirtualDomain \
>              params config="/etc/libvirt/crm/mailman.xml"
>       hypervisor="qemu:///system" \
>              meta allow-migrate="true" target-role="Started"
>       is-managed="true" \
>              op start interval="0" timeout="120s" \
>              op stop interval="0" timeout="120s" \
>              op monitor interval="10s" timeout="40s" depth="0" \
>              op migrate_to interval="0" timeout="240s" on-fail="block"
>       \
>              op migrate_from interval="0" timeout="240s"
>       on-fail="block" \
>              utilization memory="5120"
>       primitive papi ocf:heartbeat:VirtualDomain \
>              params config="/etc/libvirt/crm/papi.xml"
>       hypervisor="qemu:///system" \
>              meta allow-migrate="true" target-role="Started"
>       is-managed="true" \
>              op start interval="0" timeout="120s" \
>              op stop interval="0" timeout="120s" \
>              op monitor interval="10s" timeout="40s" depth="0" \
>              op migrate_to interval="0" timeout="240s" on-fail="block"
>       \
>              op migrate_from interval="0" timeout="240s"
>       on-fail="block" \
>              utilization memory="6144"
>       primitive radius ocf:heartbeat:VirtualDomain \
>              params config="/etc/libvirt/crm/radius.xml"
>       hypervisor="qemu:///system" \
>              meta allow-migrate="true" target-role="Started"
>       is-managed="true" \
>              op start interval="0" timeout="120s" \
>              op stop interval="0" timeout="120s" \
>              op monitor interval="10s" timeout="40s" depth="0" \
>              op migrate_to interval="0" timeout="240s" on-fail="block"
>       \
>              op migrate_from interval="0" timeout="240s"
>       on-fail="block" \
>              utilization memory="4608"
>       primitive stonith-atlas0 stonith:ipmilan \
>              params hostname="atlas0" ipaddr="192.168.40.20"
>       port="623"
>       auth="md5" priv="admin" login="root" password="XXXXX" \
>              op start interval="0" timeout="120s" \
>              meta target-role="Started"
>       primitive stonith-atlas1 stonith:ipmilan \
>              params hostname="atlas1" ipaddr="192.168.40.21"
>       port="623"
>       auth="md5" priv="admin" login="root" password="XXXX" \
>              op start interval="0" timeout="120s" \
>              meta target-role="Started"
>       primitive stonith-atlas2 stonith:ipmilan \
>              params hostname="atlas2" ipaddr="192.168.40.22"
>       port="623"
>       auth="md5" priv="admin" login="root" password="XXXX" \
>              op start interval="0" timeout="120s" \
>              meta target-role="Started"
>       primitive stonith-atlas3 stonith:ipmilan \
>              params hostname="atlas3" ipaddr="192.168.40.23"
>       port="623"
>       auth="md5" priv="admin" login="root" password="XXXX" \
>              op start interval="0" timeout="120s" \
>              meta target-role="Started"
>       primitive stonith-atlas4 stonith:ipmilan \
>              params hostname="atlas4" ipaddr="192.168.40.24"
>       port="623"
>       auth="md5" priv="admin" login="root" password="XXXX" \
>              op start interval="0" timeout="120s" \
>              meta target-role="Started"
>       primitive stonith-atlas5 stonith:apcmastersnmp \
>              params ipaddr="192.168.40.252" port="161"
>       community="XXXX"
>       pcmk_host_list="atlas5" pcmk_host_check="static-list"
>       primitive stonith-atlas6 stonith:apcmastersnmp \
>              params ipaddr="192.168.40.252" port="161"
>       community="XXXX"
>       pcmk_host_list="atlas6" pcmk_host_check="static-list"
>       primitive w0 ocf:heartbeat:VirtualDomain \
>              params config="/etc/libvirt/crm/w0.xml"
>       hypervisor="qemu:///system" \
>              meta allow-migrate="true" target-role="Started" \
>              op start interval="0" timeout="120s" \
>              op stop interval="0" timeout="120s" \
>              op monitor interval="10s" timeout="40s" depth="0" \
>              op migrate_to interval="0" timeout="240s" on-fail="block"
>       \
>              op migrate_from interval="0" timeout="240s"
>       on-fail="block" \
>              utilization memory="4608"
>       primitive webauth ocf:heartbeat:VirtualDomain \
>              params config="/etc/libvirt/crm/webauth.xml"
>       hypervisor="qemu:///system" \
>              meta allow-migrate="true" target-role="Started"
>       is-managed="true" \
>              op start interval="0" timeout="120s" \
>              op stop interval="0" timeout="120s" \
>              op monitor interval="10s" timeout="40s" depth="0" \
>              op migrate_to interval="0" timeout="240s" on-fail="block"
>       \
>              op migrate_from interval="0" timeout="240s"
>       on-fail="block" \
>              utilization memory="4608"
>       primitive wwwd ocf:heartbeat:VirtualDomain \
>              params config="/etc/libvirt/crm/wwwd.xml"
>       hypervisor="qemu:///system" \
>              meta allow-migrate="true" target-role="Started"
>       is-managed="true" \
>              op start interval="0" timeout="120s" \
>              op stop interval="0" timeout="120s" \
>              op monitor interval="10s" timeout="40s" depth="0" \
>              op migrate_to interval="0" timeout="240s" on-fail="block"
>       \
>              op migrate_from interval="0" timeout="240s"
>       on-fail="block" \
>              utilization memory="5120"
>       location location-stonith-atlas0 stonith-atlas0 -inf: atlas0
>       location location-stonith-atlas1 stonith-atlas1 -inf: atlas1
>       location location-stonith-atlas2 stonith-atlas2 -inf: atlas2
>       location location-stonith-atlas3 stonith-atlas3 -inf: atlas3
>       location location-stonith-atlas4 stonith-atlas4 -inf: atlas4
>       location location-stonith-atlas5 stonith-atlas5 -inf: atlas5
>       location location-stonith-atlas6 stonith-atlas6 -inf: atlas6
>       property $id="cib-bootstrap-options" \
>            
>        dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
>              cluster-infrastructure="openais" \
>              expected-quorum-votes="7" \
>              stonith-enabled="true" \
>              no-quorum-policy="stop" \
>              last-lrm-refresh="1340193431" \
>              symmetric-cluster="true" \
>              maintenance-mode="false" \
>              stop-all-resources="false" \
>              is-managed-default="true" \
>              placement-strategy="balanced"
> 
>       # crm_verify -L -VV
>       [...]
>       crm_verify[19320]: 2012/06/20_15:25:50 notice: LogActions: Leave
>         w0
>       (Started atlas2)
>       crm_verify[19320]: 2012/06/20_15:25:50 notice: LogActions: Leave
>       stonith-atlas6       (Started atlas4)
>       crm_verify[19320]: 2012/06/20_15:25:50 notice: LogActions: Leave
>       stonith-atlas5       (Started atlas4)
>       crm_verify[19320]: 2012/06/20_15:25:50 notice: LogActions: Leave
>       stonith-atlas4       (Started atlas3)
>       crm_verify[19320]: 2012/06/20_15:25:50 notice: LogActions: Leave
>       stonith-atlas3       (Started atlas4)
>       crm_verify[19320]: 2012/06/20_15:25:50 notice: LogActions: Leave
>       stonith-atlas2       (Started atlas4)
>       crm_verify[19320]: 2012/06/20_15:25:50 notice: LogActions: Leave
>       stonith-atlas1       (Started atlas4)
>       crm_verify[19320]: 2012/06/20_15:25:50 notice: LogActions: Leave
>       stonith-atlas0       (Started atlas4)
>       crm_verify[19320]: 2012/06/20_15:25:50 notice: LogActions: Start
>         lx0
>       (atlas4)
> 
>       I have tried to delete the resource and add again, did not help.
>       The corresponding log entries:
> 
>       Jun 20 11:57:25 atlas4 crmd: [17571]: info: delete_resource:
>       Removing
>       resource lx0 for 28654_crm_resource (internal) on atlas0
>       Jun 20 11:57:25 atlas4 lrmd: [17568]: debug: lrmd_rsc_destroy:
>       removing
>       resource lx0
>       Jun 20 11:57:25 atlas4 crmd: [17571]: debug: delete_rsc_entry:
>       sync:
>       Sending delete op for lx0
>       Jun 20 11:57:25 atlas4 crmd: [17571]: info: notify_deleted:
>       Notifying
>       28654_crm_resource on atlas0 that lx0 was deleted
>       Jun 20 11:57:25 atlas4 crmd: [17571]: WARN:
>       decode_transition_key: Bad
>       UUID (crm-resource-28654) in sscanf result (3) for
>       0:0:crm-resource-28654
>       Jun 20 11:57:25 atlas4 crmd: [17571]: debug:
>       create_operation_update:
>       send_direct_ack: Updating resouce lx0 after complete delete op
>       (interval=60000)
>       Jun 20 11:57:25 atlas4 crmd: [17571]: info: send_direct_ack:
>       ACK'ing
>       resource op lx0_delete_60000 from 0:0:crm-resource-28654:
>       lrm_invoke-lrmd-1340186245-16
>       Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] mcasted
>       message added
>       to pending queue
>       Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] mcasted
>       message added
>       to pending queue
>       Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] Delivering
>       10d5 to 10d7
>       Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] Delivering
>       MCAST
>       message with seq 10d6 to pending delivery queue
>       Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] Delivering
>       MCAST
>       message with seq 10d7 to pending delivery queue
>       Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] Received
>       ringid(192.168.40.60:22264) seq 10d6
>       Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] Received
>       ringid(192.168.40.60:22264) seq 10d7
>       Jun 20 11:57:25 atlas4 crmd: [17571]: debug: notify_deleted:
>       Triggering a
>       refresh after 28654_crm_resource deleted lx0 from the LRM
>       Jun 20 11:57:25 atlas4 cib: [17567]: debug: cib_process_xpath:
>       Processing
>       cib_query op for
> //cib/configuration/crm_config//cluster_property_set//nvpair[@name='last-lr
>       m-refresh']
>       (/cib/configuration/crm_config/cluster_property_set/nvpair[6])
> 
> 
>       Jun 20 11:57:25 atlas4 lrmd: [17568]: debug:
>       on_msg_add_rsc:client [17571]
>       adds resource lx0
>       Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] Delivering
>       149e to 149f
>       Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] Delivering
>       MCAST
>       message with seq 149f to pending delivery queue
>       Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] Received
>       ringid(192.168.40.60:22264) seq 14a0
>       Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] Delivering
>       149f to 14a0
>       Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] Delivering
>       MCAST
>       message with seq 14a0 to pending delivery queue
>       Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] releasing
>       messages up
>       to and including 149e
>       Jun 20 11:57:25 atlas4 crmd: [17571]: info: do_lrm_rsc_op:
>       Performing
>       key=26:10266:7:e7426ec7-3bae-4a4b-a4ae-c3f80f17e058
>       op=lx0_monitor_0 )
>       Jun 20 11:57:25 atlas4 lrmd: [17568]: debug:
>       on_msg_perform_op:2396:
>       copying parameters for rsc lx0
>       Jun 20 11:57:25 atlas4 lrmd: [17568]: debug: on_msg_perform_op:
>       add an
>       operation operation monitor[35] on lx0 for client 17571, its
>       parameters:
>       crm_feature_set=[3.0.5] config=[/etc/libvirt/crm/lx0.xml]
>       CRM_meta_timeout=[20000] hypervisor=[qemu:///system]  to the
>       operation
>       list.
>       Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] releasing
>       messages up
>       to and including 149f
>       Jun 20 11:57:25 atlas4 lrmd: [17568]: info: rsc:lx0 probe[35]
>       (pid 30179)
>       Jun 20 11:57:25 atlas4 VirtualDomain[30179]: INFO: Domain name
>       "lx0" saved
>       to /var/run/resource-agents/VirtualDomain-lx0.state.
>       Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] releasing
>       messages up
>       to and including 14bc
>       Jun 20 11:57:25 atlas4 VirtualDomain[30179]: DEBUG: Virtual
>       domain lx0 is
>       currently shut off.
>       Jun 20 11:57:25 atlas4 lrmd: [17568]: WARN: Managed lx0:monitor
>       process
>       30179 exited with return code 7.
>       Jun 20 11:57:25 atlas4 lrmd: [17568]: info: operation
>       monitor[35] on lx0
>       for client 17571: pid 30179 exited with return code 7
>       Jun 20 11:57:25 atlas4 crmd: [17571]: debug:
>       create_operation_update:
>       do_update_resource: Updating resouce lx0 after complete monitor
>       op
>       (interval=0)
>       Jun 20 11:57:25 atlas4 crmd: [17571]: info: process_lrm_event:
>       LRM
>       operation lx0_monitor_0 (call=35, rc=7, cib-update=61,
>       confirmed=true) not
>       running
>       Jun 20 11:57:25 atlas4 crmd: [17571]: debug:
>       update_history_cache:
>       Appending monitor op to history for 'lx0'
>       Jun 20 11:57:25 atlas4 crmd: [17571]: debug: get_xpath_object:
>       No match
>       for //cib_update_result//diff-added//crm_config in
>       /notify/cib_update_result/diff
> 
>       What can be wrong in the setup/configuration? And what on the
>       earth
>       happened?
> 
>       Best regards,
>       Jozsef
>       --
>       E-mail : kadlecsik.jozsef at wigner.mta.hu
>       PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
>       Address: Wigner Research Centre for Physics, Hungarian Academy
>       of Sciences
>               H-1525 Budapest 114, POB. 49, Hungary
> 
>       _______________________________________________
>       Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>       http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
>       Project Home: http://www.clusterlabs.org
>       Getting started:
>       http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>       Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> 
> --
> esta es mi vida e me la vivo hasta que dios quiera
> 
> 

--
E-mail : kadlecsik.jozsef at wigner.mta.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics, Hungarian Academy of Sciences
         H-1525 Budapest 114, POB. 49, Hungary