[Pacemaker] resource agent starting out-of-order
Dejan Muhamedagic
dejanmm at fastmail.fm
Mon Mar 14 16:07:38 UTC 2011
Hi,
On Sun, Mar 13, 2011 at 11:15:25PM +0300, Pavel Levshin wrote:
> Hi.
>
> You have hit this:
>
> Mar 3 16:49:16 breadnut2 VirtualDomain[20709]: INFO: Virtual domain vg.test1 currently has no state, retrying.
> Mar 3 16:49:16 breadnut2 lrmd: [20694]: WARN: p-vd_vg.test1:monitor process (PID 20709) timed out (try 1). Killing with signal SIGTERM (15).
> Mar 3 16:49:16 breadnut2 lrmd: [20694]: WARN: operation monitor[5] on ocf::VirtualDomain::p-vd_vg.test1 for client 20697, its parameters: crm_feature_set=[3.0.5] config=[/etc/libvirt/qemu/vg.test1.xml] CRM_meta_timeout=[20000] migration_transport=[tcp] : pid [20709] timed out
> Mar 3 16:49:16 breadnut2 crmd: [20697]: ERROR: process_lrm_event: LRM operation p-vd_vg.test1_monitor_0 (5) Timed Out (timeout=20000ms)
>
>
> When a cluster node comes up, it is directed to probe each clustered
> resource on the node. This behaviour does not depend on constraints,
> this check is mandatory.
>
> At the moment, libvirtd is not running yet. Thus, VirtualDomain RA is
> unable to connect to it and to check if your VM is running. So it times
> out after some time.
>
> Timeout of monitor action implies "unknown error" of the resource.
> Pengine cannot ensure that your resource is not running, so it believes
> it is, and stops the resource everywhere, then starts it again to
> recover.
>
> This is what you get. How to work around is a different story. Frankly,
> I don't see a decent way.
>
> VirtualDomain RA really cannot tell if VM is running while it cannot
> connect to libvirtd. I'm not too sure, but your log suggests that
> libvirtd will not be started until VirtualDomain monitor returns.
>
> I'd suggest you to start libvirtd before corosync, from initscripts, and
> see if it helps.
Right.
> May anyone propose a cleaner solution?
No. The RA clearly states that libvirtd is required. The
corosync/heartbeat init scripts should have it as Should-Start.
Thanks,
Dejan
>
> --
> Pavel Levshin
>
>
> 03.03.2011 9:05, AP пишет:
>> Hi,
>>
>> Having deep issues with my cluster setup. Everything works ok until
>> I add a VirtualDomain RA in. Then things go pearshaped in that it seems
>> to ignore the "order" crm config for it and starts as soon as it can.
>>
>> The crm config is provided below. Basically p-vd_vg.test1 attempts to
>> start despite p-libvirtd not being started and p-drbd_vg.test1 not
>> being master (or slave for that matter - ie it's not configured at all).
>>
>> Eventually p-libvirtd and p-drbd_vg.test1 start and p-vd_vg.test1 attempts
>> to, pengine on the node where p-vd_vg.test1 is already running complains
>> with:
>>
>> Mar 3 16:49:16 breadnut pengine: [2097]: ERROR: native_create_actions: Resource p-vd_vg.test1 (ocf::VirtualDomain) is active on 2 nodes attempting recovery
>> Mar 3 16:49:16 breadnut pengine: [2097]: WARN: See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.
>>
>> Then mass slaughter occurs and p-vd_vg.test1 is restarted where it was
>> running previously whilst the other node gets an error for it.
>>
>> Essentially I cannot restart the 2nd node without it breaking the 1st.
>>
>> Now, as I understand it, a lone primitive will run once on any node - this
>> is just fine by me.
>>
>> colo-vd_vg.test1 indicates that p-vd_vg.test1 should run where ms-drbd_vg.test1
>> is master. ms-drbd_vg.test1 should only be master where clone-libvirtd is
>> started.
>>
>> order-vg.test1 indicates that ms-drbd_vg.test1 should start after clone-lvm_gh
>> is started (successfully). (This used to have a promote for ms-drbd_vg.test1
>> but then ms-drbd_vg.test1 would be demoted and not stopped on shutdown which
>> would cause clone-lvm_gh to error out on stop)
>>
>> order-vd_vg.test1 indicates p-vd_vg.test1 should only start where
>> ms-drbd_vg.test1 and clone-libvirtd have both successfully started (the
>> order of their starting being irrelevant).
>>
>> cli-standby-p-vd_vg.test1 was put there by my migrating p-vd_vg.test1
>> about the place.
>>
>> This happens with or without fencing and with fencing configured as below
>> or as just a single primited with both nodes in the hostlist.
>>
>> Help with this would be awesome and appreciated. I do not know what I am
>> missing here. The config makes sense to me so I don't even know where
>> to start poking and prodding. I be flailing.
>>
>> Config and s/w version list is below:
>>
>> OS: Debian Squeeze
>> Kernel: 2.6.37.2
>>
>> PACKAGES:
>>
>> ii cluster-agents 1:1.0.4-0ubuntu1~custom1 The reusable cluster components for Linux HA
>> ii cluster-glue 1.0.7-3ubuntu1~custom1 The reusable cluster components for Linux HA
>> ii corosync 1.3.0-1ubuntu1~custom1 Standards-based cluster framework (daemon and modules)
>> ii libccs3 3.1.0-0ubuntu1~custom1 Red Hat cluster suite - cluster configuration libraries
>> ii libcib1 1.1.5-0ubuntu1~ppa1~custom1 The Pacemaker libraries - CIB
>> ii libcman3 3.1.0-0ubuntu1~custom1 Red Hat cluster suite - cluster manager libraries
>> ii libcorosync4 1.3.0-1ubuntu1~custom1 Standards-based cluster framework (libraries)
>> ii libcrmcluster1 1.1.5-0ubuntu1~ppa1~custom1 The Pacemaker libraries - CRM
>> ii libcrmcommon2 1.1.5-0ubuntu1~ppa1~custom1 The Pacemaker libraries - common CRM
>> ii libfence4 3.1.0-0ubuntu1~custom1 Red Hat cluster suite - fence client library
>> ii liblrm2 1.0.7-3ubuntu1~custom1 Reusable cluster libraries -- liblrm2
>> ii libpe-rules2 1.1.5-0ubuntu1~ppa1~custom1 The Pacemaker libraries - rules for P-Engine
>> ii libpe-status3 1.1.5-0ubuntu1~ppa1~custom1 The Pacemaker libraries - status for P-Engine
>> ii libpengine3 1.1.5-0ubuntu1~ppa1~custom1 The Pacemaker libraries - P-Engine
>> ii libpils2 1.0.7-3ubuntu1~custom1 Reusable cluster libraries -- libpils2
>> ii libplumb2 1.0.7-3ubuntu1~custom1 Reusable cluster libraries -- libplumb2
>> ii libplumbgpl2 1.0.7-3ubuntu1~custom1 Reusable cluster libraries -- libplumbgpl2
>> ii libstonith1 1.0.7-3ubuntu1~custom1 Reusable cluster libraries -- libstonith1
>> ii libstonithd1 1.1.5-0ubuntu1~ppa1~custom1 The Pacemaker libraries - stonith
>> ii libtransitioner1 1.1.5-0ubuntu1~ppa1~custom1 The Pacemaker libraries - transitioner
>> ii pacemaker 1.1.5-0ubuntu1~ppa1~custom1 HA cluster resource manager
>>
>> CONFIG:
>>
>> node breadnut
>> node breadnut2 \
>> attributes standby="off"
>> primitive fencing-bn stonith:meatware \
>> params hostlist="breadnut" \
>> op start interval="0" timeout="60s" \
>> op stop interval="0" timeout="70s" \
>> op monitor interval="10" timeout="60s"
>> primitive fencing-bn2 stonith:meatware \
>> params hostlist="breadnut2" \
>> op start interval="0" timeout="60s" \
>> op stop interval="0" timeout="70s" \
>> op monitor interval="10" timeout="60s"
>> primitive p-drbd_vg.test1 ocf:linbit:drbd \
>> params drbd_resource="vg.test1" \
>> operations $id="ops-drbd_vg.test1" \
>> op start interval="0" timeout="240s" \
>> op stop interval="0" timeout="100s" \
>> op monitor interval="20" role="Master" timeout="20s" \
>> op monitor interval="30" role="Slave" timeout="20s"
>> primitive p-libvirtd ocf:local:libvirtd \
>> meta allow-migrate="off" \
>> op start interval="0" timeout="200s" \
>> op stop interval="0" timeout="100s" \
>> op monitor interval="10" timeout="200s"
>> primitive p-lvm_gh ocf:heartbeat:LVM \
>> params volgrpname="gh" \
>> meta allow-migrate="off" \
>> op start interval="0" timeout="90s" \
>> op stop interval="0" timeout="100s" \
>> op monitor interval="10" timeout="100s"
>> primitive p-vd_vg.test1 ocf:heartbeat:VirtualDomain \
>> params config="/etc/libvirt/qemu/vg.test1.xml" \
>> params migration_transport="tcp" \
>> meta allow-migrate="true" is-managed="true" \
>> op start interval="0" timeout="120s" \
>> op stop interval="0" timeout="120s" \
>> op migrate_to interval="0" timeout="120s" \
>> op migrate_from interval="0" timeout="120s" \
>> op monitor interval="10s" timeout="120s"
>> ms ms-drbd_vg.test1 p-drbd_vg.test1 \
>> meta resource-stickines="100" notify="true" master-max="2" target-role="Master"
>> clone clone-libvirtd p-libvirtd \
>> meta interleave="true"
>> clone clone-lvm_gh p-lvm_gh \
>> meta interleave="true"
>> location cli-standby-p-vd_vg.test1 p-vd_vg.test1 \
>> rule $id="cli-standby-rule-p-vd_vg.test1" -inf: #uname eq breadnut2
>> location loc-fencing-bn fencing-bn -inf: breadnut
>> location loc-fencing-bn2 fencing-bn2 -inf: breadnut2
>> colocation colo-vd_vg.test1 inf: p-vd_vg.test1:Started ms-drbd_vg.test1:Master clone-libvirtd:Started
>> order order-vd_vg.test1 inf: ( ms-drbd_vg.test1:start clone-libvirtd:start ) p-vd_vg.test1:start
>> order order-vg.test1 inf: clone-lvm_gh:start ms-drbd_vg.test1:start
>> property $id="cib-bootstrap-options" \
>> dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
>> cluster-infrastructure="openais" \
>> default-resource-stickiness="1000" \
>> stonith-enabled="true" \
>> expected-quorum-votes="2" \
>> no-quorum-policy="ignore" \
>> last-lrm-refresh="1299128317"
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
More information about the Pacemaker
mailing list