[Pacemaker] Cannot start VirtualDomain resource after restart

Thu Jun 21 04:30:05 CEST 2012

On Wed, Jun 20, 2012 at 11:51 PM, emmanuel segura <emi2fast at gmail.com> wrote:
> Hello
>
> Why you say there is not error in the message

Because it doesn't say "error" anywhere?
The logs below look completely normal for a node thats just joined the cluster.

> =========================================================
>
> Jun 20 11:57:25 atlas4 lrmd: [17568]: info: operation monitor[35] on lx0
> for client 17571: pid 30179 exited with return code 7
> Jun 20 11:57:25 atlas4 crmd: [17571]: debug: create_operation_update:
> do_update_resource: Updating resouce lx0 after complete monitor op
> (interval=0)
> Jun 20 11:57:25 atlas4 crmd: [17571]: info: process_lrm_event: LRM
> operation lx0_monitor_0 (call=35, rc=7, cib-update=61, confirmed=true) not
> running
> =========================================================
>
>
> 2012/6/20 Kadlecsik József <kadlecsik.jozsef at wigner.mta.hu>
>>
>> Hello,
>>
>> Somehow a VirtualDomain resource after a "crm resource restart", which did
>> *not* start the resource but just stop, the resource cannot be started
>> anymore. The most baffling is that I do not see an error message. The
>> resource in question, named 'lx0', can be started directly via
>> virsh/libvirt and libvirtd is running on all cluster nodes.
>>
>> We run corosync 1.4.2-1~bpo60+1, pacemaker 1.1.6-2~bpo60+1 (debian).
>>
>> # crm status
>> ============
>> Last updated: Wed Jun 20 15:14:44 2012
>> Last change: Wed Jun 20 14:07:40 2012 via cibadmin on atlas0
>> Stack: openais
>> Current DC: atlas0 - partition with quorum
>> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
>> 7 Nodes configured, 7 expected votes
>> 18 Resources configured.
>> ============
>>
>> Online: [ atlas0 atlas1 atlas2 atlas3 atlas4 atlas5 atlas6 ]
>>
>>  kerberos       (ocf::heartbeat:VirtualDomain): Started atlas0
>>  stonith-atlas3 (stonith:ipmilan):      Started atlas4
>>  stonith-atlas1 (stonith:ipmilan):      Started atlas4
>>  stonith-atlas2 (stonith:ipmilan):      Started atlas4
>>  stonith-atlas0 (stonith:ipmilan):      Started atlas4
>>  stonith-atlas4 (stonith:ipmilan):      Started atlas3
>>  mailman        (ocf::heartbeat:VirtualDomain): Started atlas6
>>  indico (ocf::heartbeat:VirtualDomain): Started atlas0
>>  papi   (ocf::heartbeat:VirtualDomain): Started atlas1
>>  wwwd   (ocf::heartbeat:VirtualDomain): Started atlas2
>>  webauth        (ocf::heartbeat:VirtualDomain): Started atlas3
>>  caladan        (ocf::heartbeat:VirtualDomain): Started atlas4
>>  radius (ocf::heartbeat:VirtualDomain): Started atlas5
>>  mail0  (ocf::heartbeat:VirtualDomain): Started atlas6
>>  stonith-atlas5 (stonith:apcmastersnmp):        Started atlas4
>>  stonith-atlas6 (stonith:apcmastersnmp):        Started atlas4
>>  w0     (ocf::heartbeat:VirtualDomain): Started atlas2
>>
>> # crm resource show
>>  kerberos       (ocf::heartbeat:VirtualDomain) Started
>>  stonith-atlas3 (stonith:ipmilan) Started
>>  stonith-atlas1 (stonith:ipmilan) Started
>>  stonith-atlas2 (stonith:ipmilan) Started
>>  stonith-atlas0 (stonith:ipmilan) Started
>>  stonith-atlas4 (stonith:ipmilan) Started
>>  mailman        (ocf::heartbeat:VirtualDomain) Started
>>  indico (ocf::heartbeat:VirtualDomain) Started
>>  papi   (ocf::heartbeat:VirtualDomain) Started
>>  wwwd   (ocf::heartbeat:VirtualDomain) Started
>>  webauth        (ocf::heartbeat:VirtualDomain) Started
>>  caladan        (ocf::heartbeat:VirtualDomain) Started
>>  radius (ocf::heartbeat:VirtualDomain) Started
>>  mail0  (ocf::heartbeat:VirtualDomain) Started
>>  stonith-atlas5 (stonith:apcmastersnmp) Started
>>  stonith-atlas6 (stonith:apcmastersnmp) Started
>>  w0     (ocf::heartbeat:VirtualDomain) Started
>>  lx0    (ocf::heartbeat:VirtualDomain) Stopped
>>
>> # crm configure show
>> node atlas0 \
>>        attributes standby="false" \
>>        utilization memory="24576"
>> node atlas1 \
>>        attributes standby="false" \
>>        utilization memory="24576"
>> node atlas2 \
>>        attributes standby="false" \
>>        utilization memory="24576"
>> node atlas3 \
>>        attributes standby="false" \
>>        utilization memory="24576"
>> node atlas4 \
>>        attributes standby="false" \
>>        utilization memory="24576"
>> node atlas5 \
>>        attributes standby="off" \
>>        utilization memory="20480"
>> node atlas6 \
>>        attributes standby="off" \
>>        utilization memory="20480"
>> primitive caladan ocf:heartbeat:VirtualDomain \
>>        params config="/etc/libvirt/crm/caladan.xml"
>> hypervisor="qemu:///system" \
>>        meta allow-migrate="true" target-role="Started" is-managed="true" \
>>        op start interval="0" timeout="120s" \
>>        op stop interval="0" timeout="120s" \
>>        op monitor interval="10s" timeout="40s" depth="0" \
>>        op migrate_to interval="0" timeout="240s" on-fail="block" \
>>        op migrate_from interval="0" timeout="240s" on-fail="block" \
>>        utilization memory="4608"
>> primitive indico ocf:heartbeat:VirtualDomain \
>>        params config="/etc/libvirt/crm/indico.xml"
>> hypervisor="qemu:///system" \
>>        meta allow-migrate="true" target-role="Started" is-managed="true" \
>>        op start interval="0" timeout="120s" \
>>        op stop interval="0" timeout="120s" \
>>        op monitor interval="10s" timeout="40s" depth="0" \
>>        op migrate_to interval="0" timeout="240s" on-fail="block" \
>>        op migrate_from interval="0" timeout="240s" on-fail="block" \
>>        utilization memory="5120"
>> primitive kerberos ocf:heartbeat:VirtualDomain \
>>        params config="/etc/libvirt/qemu/kerberos.xml"
>> hypervisor="qemu:///system" \
>>        meta allow-migrate="true" target-role="Started" is-managed="true" \
>>        op start interval="0" timeout="120s" \
>>        op stop interval="0" timeout="120s" \
>>        op monitor interval="10s" timeout="40s" depth="0" \
>>        op migrate_to interval="0" timeout="240s" on-fail="block" \
>>        op migrate_from interval="0" timeout="240s" on-fail="block" \
>>        utilization memory="4608"
>> primitive lx0 ocf:heartbeat:VirtualDomain \
>>        params config="/etc/libvirt/crm/lx0.xml"
>> hypervisor="qemu:///system" \
>>        meta allow-migrate="true" target-role="Started" is-managed="true" \
>>        op start interval="0" timeout="120s" \
>>        op stop interval="0" timeout="120s" \
>>        op monitor interval="10s" timeout="40s" depth="0" \
>>        op migrate_to interval="0" timeout="240s" on-fail="block" \
>>        op migrate_from interval="0" timeout="240s" on-fail="block" \
>>        utilization memory="4608"
>> primitive mail0 ocf:heartbeat:VirtualDomain \
>>        params config="/etc/libvirt/crm/mail0.xml"
>> hypervisor="qemu:///system" \
>>        meta allow-migrate="true" target-role="Started" is-managed="true" \
>>        op start interval="0" timeout="120s" \
>>        op stop interval="0" timeout="120s" \
>>        op monitor interval="10s" timeout="40s" depth="0" \
>>        op migrate_to interval="0" timeout="240s" on-fail="block" \
>>        op migrate_from interval="0" timeout="240s" on-fail="block" \
>>        utilization memory="4608"
>> primitive mailman ocf:heartbeat:VirtualDomain \
>>        params config="/etc/libvirt/crm/mailman.xml"
>> hypervisor="qemu:///system" \
>>        meta allow-migrate="true" target-role="Started" is-managed="true" \
>>        op start interval="0" timeout="120s" \
>>        op stop interval="0" timeout="120s" \
>>        op monitor interval="10s" timeout="40s" depth="0" \
>>        op migrate_to interval="0" timeout="240s" on-fail="block" \
>>        op migrate_from interval="0" timeout="240s" on-fail="block" \
>>        utilization memory="5120"
>> primitive papi ocf:heartbeat:VirtualDomain \
>>        params config="/etc/libvirt/crm/papi.xml"
>> hypervisor="qemu:///system" \
>>        meta allow-migrate="true" target-role="Started" is-managed="true" \
>>        op start interval="0" timeout="120s" \
>>        op stop interval="0" timeout="120s" \
>>        op monitor interval="10s" timeout="40s" depth="0" \
>>        op migrate_to interval="0" timeout="240s" on-fail="block" \
>>        op migrate_from interval="0" timeout="240s" on-fail="block" \
>>        utilization memory="6144"
>> primitive radius ocf:heartbeat:VirtualDomain \
>>        params config="/etc/libvirt/crm/radius.xml"
>> hypervisor="qemu:///system" \
>>        meta allow-migrate="true" target-role="Started" is-managed="true" \
>>        op start interval="0" timeout="120s" \
>>        op stop interval="0" timeout="120s" \
>>        op monitor interval="10s" timeout="40s" depth="0" \
>>        op migrate_to interval="0" timeout="240s" on-fail="block" \
>>        op migrate_from interval="0" timeout="240s" on-fail="block" \
>>        utilization memory="4608"
>> primitive stonith-atlas0 stonith:ipmilan \
>>        params hostname="atlas0" ipaddr="192.168.40.20" port="623"
>> auth="md5" priv="admin" login="root" password="XXXXX" \
>>        op start interval="0" timeout="120s" \
>>        meta target-role="Started"
>> primitive stonith-atlas1 stonith:ipmilan \
>>        params hostname="atlas1" ipaddr="192.168.40.21" port="623"
>> auth="md5" priv="admin" login="root" password="XXXX" \
>>        op start interval="0" timeout="120s" \
>>        meta target-role="Started"
>> primitive stonith-atlas2 stonith:ipmilan \
>>        params hostname="atlas2" ipaddr="192.168.40.22" port="623"
>> auth="md5" priv="admin" login="root" password="XXXX" \
>>        op start interval="0" timeout="120s" \
>>        meta target-role="Started"
>> primitive stonith-atlas3 stonith:ipmilan \
>>        params hostname="atlas3" ipaddr="192.168.40.23" port="623"
>> auth="md5" priv="admin" login="root" password="XXXX" \
>>        op start interval="0" timeout="120s" \
>>        meta target-role="Started"
>> primitive stonith-atlas4 stonith:ipmilan \
>>        params hostname="atlas4" ipaddr="192.168.40.24" port="623"
>> auth="md5" priv="admin" login="root" password="XXXX" \
>>        op start interval="0" timeout="120s" \
>>        meta target-role="Started"
>> primitive stonith-atlas5 stonith:apcmastersnmp \
>>        params ipaddr="192.168.40.252" port="161" community="XXXX"
>> pcmk_host_list="atlas5" pcmk_host_check="static-list"
>> primitive stonith-atlas6 stonith:apcmastersnmp \
>>        params ipaddr="192.168.40.252" port="161" community="XXXX"
>> pcmk_host_list="atlas6" pcmk_host_check="static-list"
>> primitive w0 ocf:heartbeat:VirtualDomain \
>>        params config="/etc/libvirt/crm/w0.xml" hypervisor="qemu:///system"
>> \
>>        meta allow-migrate="true" target-role="Started" \
>>        op start interval="0" timeout="120s" \
>>        op stop interval="0" timeout="120s" \
>>        op monitor interval="10s" timeout="40s" depth="0" \
>>        op migrate_to interval="0" timeout="240s" on-fail="block" \
>>        op migrate_from interval="0" timeout="240s" on-fail="block" \
>>        utilization memory="4608"
>> primitive webauth ocf:heartbeat:VirtualDomain \
>>        params config="/etc/libvirt/crm/webauth.xml"
>> hypervisor="qemu:///system" \
>>        meta allow-migrate="true" target-role="Started" is-managed="true" \
>>        op start interval="0" timeout="120s" \
>>        op stop interval="0" timeout="120s" \
>>        op monitor interval="10s" timeout="40s" depth="0" \
>>        op migrate_to interval="0" timeout="240s" on-fail="block" \
>>        op migrate_from interval="0" timeout="240s" on-fail="block" \
>>        utilization memory="4608"
>> primitive wwwd ocf:heartbeat:VirtualDomain \
>>        params config="/etc/libvirt/crm/wwwd.xml"
>> hypervisor="qemu:///system" \
>>        meta allow-migrate="true" target-role="Started" is-managed="true" \
>>        op start interval="0" timeout="120s" \
>>        op stop interval="0" timeout="120s" \
>>        op monitor interval="10s" timeout="40s" depth="0" \
>>        op migrate_to interval="0" timeout="240s" on-fail="block" \
>>        op migrate_from interval="0" timeout="240s" on-fail="block" \
>>        utilization memory="5120"
>> location location-stonith-atlas0 stonith-atlas0 -inf: atlas0
>> location location-stonith-atlas1 stonith-atlas1 -inf: atlas1
>> location location-stonith-atlas2 stonith-atlas2 -inf: atlas2
>> location location-stonith-atlas3 stonith-atlas3 -inf: atlas3
>> location location-stonith-atlas4 stonith-atlas4 -inf: atlas4
>> location location-stonith-atlas5 stonith-atlas5 -inf: atlas5
>> location location-stonith-atlas6 stonith-atlas6 -inf: atlas6
>> property $id="cib-bootstrap-options" \
>>        dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
>>        cluster-infrastructure="openais" \
>>        expected-quorum-votes="7" \
>>        stonith-enabled="true" \
>>        no-quorum-policy="stop" \
>>        last-lrm-refresh="1340193431" \
>>        symmetric-cluster="true" \
>>        maintenance-mode="false" \
>>        stop-all-resources="false" \
>>        is-managed-default="true" \
>>        placement-strategy="balanced"
>>
>> # crm_verify -L -VV
>> [...]
>> crm_verify[19320]: 2012/06/20_15:25:50 notice: LogActions: Leave   w0
>> (Started atlas2)
>> crm_verify[19320]: 2012/06/20_15:25:50 notice: LogActions: Leave
>> stonith-atlas6       (Started atlas4)
>> crm_verify[19320]: 2012/06/20_15:25:50 notice: LogActions: Leave
>> stonith-atlas5       (Started atlas4)
>> crm_verify[19320]: 2012/06/20_15:25:50 notice: LogActions: Leave
>> stonith-atlas4       (Started atlas3)
>> crm_verify[19320]: 2012/06/20_15:25:50 notice: LogActions: Leave
>> stonith-atlas3       (Started atlas4)
>> crm_verify[19320]: 2012/06/20_15:25:50 notice: LogActions: Leave
>> stonith-atlas2       (Started atlas4)
>> crm_verify[19320]: 2012/06/20_15:25:50 notice: LogActions: Leave
>> stonith-atlas1       (Started atlas4)
>> crm_verify[19320]: 2012/06/20_15:25:50 notice: LogActions: Leave
>> stonith-atlas0       (Started atlas4)
>> crm_verify[19320]: 2012/06/20_15:25:50 notice: LogActions: Start   lx0
>> (atlas4)
>>
>> I have tried to delete the resource and add again, did not help.
>> The corresponding log entries:
>>
>> Jun 20 11:57:25 atlas4 crmd: [17571]: info: delete_resource: Removing
>> resource lx0 for 28654_crm_resource (internal) on atlas0
>> Jun 20 11:57:25 atlas4 lrmd: [17568]: debug: lrmd_rsc_destroy: removing
>> resource lx0
>> Jun 20 11:57:25 atlas4 crmd: [17571]: debug: delete_rsc_entry: sync:
>> Sending delete op for lx0
>> Jun 20 11:57:25 atlas4 crmd: [17571]: info: notify_deleted: Notifying
>> 28654_crm_resource on atlas0 that lx0 was deleted
>> Jun 20 11:57:25 atlas4 crmd: [17571]: WARN: decode_transition_key: Bad
>> UUID (crm-resource-28654) in sscanf result (3) for 0:0:crm-resource-28654
>> Jun 20 11:57:25 atlas4 crmd: [17571]: debug: create_operation_update:
>> send_direct_ack: Updating resouce lx0 after complete delete op
>> (interval=60000)
>> Jun 20 11:57:25 atlas4 crmd: [17571]: info: send_direct_ack: ACK'ing
>> resource op lx0_delete_60000 from 0:0:crm-resource-28654:
>> lrm_invoke-lrmd-1340186245-16
>> Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] mcasted message added
>> to pending queue
>> Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] mcasted message added
>> to pending queue
>> Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] Delivering 10d5 to 10d7
>> Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] Delivering MCAST
>> message with seq 10d6 to pending delivery queue
>> Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] Delivering MCAST
>> message with seq 10d7 to pending delivery queue
>> Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] Received
>> ringid(192.168.40.60:22264) seq 10d6
>> Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] Received
>> ringid(192.168.40.60:22264) seq 10d7
>> Jun 20 11:57:25 atlas4 crmd: [17571]: debug: notify_deleted: Triggering a
>> refresh after 28654_crm_resource deleted lx0 from the LRM
>> Jun 20 11:57:25 atlas4 cib: [17567]: debug: cib_process_xpath: Processing
>> cib_query op for
>>
>> //cib/configuration/crm_config//cluster_property_set//nvpair[@name='last-lrm-refresh']
>> (/cib/configuration/crm_config/cluster_property_set/nvpair[6])
>>
>>
>> Jun 20 11:57:25 atlas4 lrmd: [17568]: debug: on_msg_add_rsc:client [17571]
>> adds resource lx0
>> Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] Delivering 149e to 149f
>> Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] Delivering MCAST
>> message with seq 149f to pending delivery queue
>> Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] Received
>> ringid(192.168.40.60:22264) seq 14a0
>> Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] Delivering 149f to 14a0
>> Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] Delivering MCAST
>> message with seq 14a0 to pending delivery queue
>> Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] releasing messages up
>> to and including 149e
>> Jun 20 11:57:25 atlas4 crmd: [17571]: info: do_lrm_rsc_op: Performing
>> key=26:10266:7:e7426ec7-3bae-4a4b-a4ae-c3f80f17e058 op=lx0_monitor_0 )
>> Jun 20 11:57:25 atlas4 lrmd: [17568]: debug: on_msg_perform_op:2396:
>> copying parameters for rsc lx0
>> Jun 20 11:57:25 atlas4 lrmd: [17568]: debug: on_msg_perform_op: add an
>> operation operation monitor[35] on lx0 for client 17571, its parameters:
>> crm_feature_set=[3.0.5] config=[/etc/libvirt/crm/lx0.xml]
>> CRM_meta_timeout=[20000] hypervisor=[qemu:///system]  to the operation
>> list.
>> Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] releasing messages up
>> to and including 149f
>> Jun 20 11:57:25 atlas4 lrmd: [17568]: info: rsc:lx0 probe[35] (pid 30179)
>> Jun 20 11:57:25 atlas4 VirtualDomain[30179]: INFO: Domain name "lx0" saved
>> to /var/run/resource-agents/VirtualDomain-lx0.state.
>> Jun 20 11:57:25 atlas4 corosync[17530]:   [TOTEM ] releasing messages up
>> to and including 14bc
>> Jun 20 11:57:25 atlas4 VirtualDomain[30179]: DEBUG: Virtual domain lx0 is
>> currently shut off.
>> Jun 20 11:57:25 atlas4 lrmd: [17568]: WARN: Managed lx0:monitor process
>> 30179 exited with return code 7.
>> Jun 20 11:57:25 atlas4 lrmd: [17568]: info: operation monitor[35] on lx0
>> for client 17571: pid 30179 exited with return code 7
>> Jun 20 11:57:25 atlas4 crmd: [17571]: debug: create_operation_update:
>> do_update_resource: Updating resouce lx0 after complete monitor op
>> (interval=0)
>> Jun 20 11:57:25 atlas4 crmd: [17571]: info: process_lrm_event: LRM
>> operation lx0_monitor_0 (call=35, rc=7, cib-update=61, confirmed=true) not
>> running
>> Jun 20 11:57:25 atlas4 crmd: [17571]: debug: update_history_cache:
>> Appending monitor op to history for 'lx0'
>> Jun 20 11:57:25 atlas4 crmd: [17571]: debug: get_xpath_object: No match
>> for //cib_update_result//diff-added//crm_config in
>> /notify/cib_update_result/diff
>>
>> What can be wrong in the setup/configuration? And what on the earth
>> happened?
>>
>> Best regards,
>> Jozsef
>> --
>> E-mail : kadlecsik.jozsef at wigner.mta.hu
>> PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
>> Address: Wigner Research Centre for Physics, Hungarian Academy of Sciences
>>         H-1525 Budapest 114, POB. 49, Hungary
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>