[Pacemaker] resource agent starting out-of-order

Sun Mar 13 16:15:25 EDT 2011

Hi.

You have hit this:

Mar  3 16:49:16 breadnut2 VirtualDomain[20709]: INFO: Virtual domain vg.test1 currently has no state, retrying.
Mar  3 16:49:16 breadnut2 lrmd: [20694]: WARN: p-vd_vg.test1:monitor process (PID 20709) timed out (try 1).  Killing with signal SIGTERM (15).
Mar  3 16:49:16 breadnut2 lrmd: [20694]: WARN: operation monitor[5] on ocf::VirtualDomain::p-vd_vg.test1 for client 20697, its parameters: crm_feature_set=[3.0.5] config=[/etc/libvirt/qemu/vg.test1.xml] CRM_meta_timeout=[20000] migration_transport=[tcp] : pid [20709] timed out
Mar  3 16:49:16 breadnut2 crmd: [20697]: ERROR: process_lrm_event: LRM operation p-vd_vg.test1_monitor_0 (5) Timed Out (timeout=20000ms)

When a cluster node comes up, it is directed to probe each clustered 
resource on the node. This behaviour does not depend on constraints, 
this check is mandatory.

At the moment, libvirtd is not running yet. Thus, VirtualDomain RA is 
unable to connect to it and to check if your VM is running. So it times 
out after some time.

Timeout of monitor action implies "unknown error" of the resource. 
Pengine cannot ensure that your resource is not running, so it believes 
it is, and stops the resource everywhere, then starts it again to recover.

This is what you get. How to work around is a different story. Frankly, 
I don't see a decent way.

VirtualDomain RA really cannot tell if VM is running while it cannot 
connect to libvirtd. I'm not too sure, but your log suggests that 
libvirtd will not be started until VirtualDomain monitor returns.

I'd suggest you to start libvirtd before corosync, from initscripts, and 
see if it helps.

May anyone propose a cleaner solution?

--
Pavel Levshin

03.03.2011 9:05, AP пишет:
> Hi,
>
> Having deep issues with my cluster setup. Everything works ok until
> I add a VirtualDomain RA in. Then things go pearshaped in that it seems
> to ignore the "order" crm config for it and starts as soon as it can.
>
> The crm config is provided below. Basically p-vd_vg.test1 attempts to
> start despite p-libvirtd not being started and p-drbd_vg.test1 not
> being master (or slave for that matter - ie it's not configured at all).
>
> Eventually p-libvirtd and p-drbd_vg.test1 start and p-vd_vg.test1 attempts
> to, pengine on the node where p-vd_vg.test1 is already running complains
> with:
>
> Mar  3 16:49:16 breadnut pengine: [2097]: ERROR: native_create_actions: Resource p-vd_vg.test1 (ocf::VirtualDomain) is active on 2 nodes attempting recovery
> Mar  3 16:49:16 breadnut pengine: [2097]: WARN: See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.
>
> Then mass slaughter occurs and p-vd_vg.test1 is restarted where it was
> running previously whilst the other node gets an error for it.
>
> Essentially I cannot restart the 2nd node without it breaking the 1st.
>
> Now, as I understand it, a lone primitive will run once on any node - this
> is just fine by me.
>
> colo-vd_vg.test1 indicates that p-vd_vg.test1 should run where ms-drbd_vg.test1
> is master. ms-drbd_vg.test1 should only be master where clone-libvirtd is
> started.
>
> order-vg.test1 indicates that ms-drbd_vg.test1 should start after clone-lvm_gh
> is started (successfully). (This used to have a promote for ms-drbd_vg.test1
> but then ms-drbd_vg.test1 would be demoted and not stopped on shutdown which
> would cause clone-lvm_gh to error out on stop)
>
> order-vd_vg.test1 indicates p-vd_vg.test1 should only start where
> ms-drbd_vg.test1 and clone-libvirtd have both successfully started (the
> order of their starting being irrelevant).
>
> cli-standby-p-vd_vg.test1 was put there by my migrating p-vd_vg.test1
> about the place.
>
> This happens with or without fencing and with fencing configured as below
> or as just a single primited with both nodes in the hostlist.
>
> Help with this would be awesome and appreciated. I do not know what I am
> missing here. The config makes sense to me so I don't even know where
> to start poking and prodding. I be flailing.
>
> Config and s/w version list is below:
>
> OS: Debian Squeeze
> Kernel: 2.6.37.2
>
> PACKAGES:
>
> ii  cluster-agents                      1:1.0.4-0ubuntu1~custom1     The reusable cluster components for Linux HA
> ii  cluster-glue                        1.0.7-3ubuntu1~custom1       The reusable cluster components for Linux HA
> ii  corosync                            1.3.0-1ubuntu1~custom1       Standards-based cluster framework (daemon and modules)
> ii  libccs3                             3.1.0-0ubuntu1~custom1       Red Hat cluster suite - cluster configuration libraries
> ii  libcib1                             1.1.5-0ubuntu1~ppa1~custom1  The Pacemaker libraries - CIB
> ii  libcman3                            3.1.0-0ubuntu1~custom1       Red Hat cluster suite - cluster manager libraries
> ii  libcorosync4                        1.3.0-1ubuntu1~custom1       Standards-based cluster framework (libraries)
> ii  libcrmcluster1                      1.1.5-0ubuntu1~ppa1~custom1  The Pacemaker libraries - CRM
> ii  libcrmcommon2                       1.1.5-0ubuntu1~ppa1~custom1  The Pacemaker libraries - common CRM
> ii  libfence4                           3.1.0-0ubuntu1~custom1       Red Hat cluster suite - fence client library
> ii  liblrm2                             1.0.7-3ubuntu1~custom1       Reusable cluster libraries -- liblrm2
> ii  libpe-rules2                        1.1.5-0ubuntu1~ppa1~custom1  The Pacemaker libraries - rules for P-Engine
> ii  libpe-status3                       1.1.5-0ubuntu1~ppa1~custom1  The Pacemaker libraries - status for P-Engine
> ii  libpengine3                         1.1.5-0ubuntu1~ppa1~custom1  The Pacemaker libraries - P-Engine
> ii  libpils2                            1.0.7-3ubuntu1~custom1       Reusable cluster libraries -- libpils2
> ii  libplumb2                           1.0.7-3ubuntu1~custom1       Reusable cluster libraries -- libplumb2
> ii  libplumbgpl2                        1.0.7-3ubuntu1~custom1       Reusable cluster libraries -- libplumbgpl2
> ii  libstonith1                         1.0.7-3ubuntu1~custom1       Reusable cluster libraries -- libstonith1
> ii  libstonithd1                        1.1.5-0ubuntu1~ppa1~custom1  The Pacemaker libraries - stonith
> ii  libtransitioner1                    1.1.5-0ubuntu1~ppa1~custom1  The Pacemaker libraries - transitioner
> ii  pacemaker                           1.1.5-0ubuntu1~ppa1~custom1  HA cluster resource manager
>
> CONFIG:
>
> node breadnut
> node breadnut2 \
>          attributes standby="off"
> primitive fencing-bn stonith:meatware \
>          params hostlist="breadnut" \
>          op start interval="0" timeout="60s" \
>          op stop interval="0" timeout="70s" \
>          op monitor interval="10" timeout="60s"
> primitive fencing-bn2 stonith:meatware \
>          params hostlist="breadnut2" \
>          op start interval="0" timeout="60s" \
>          op stop interval="0" timeout="70s" \
>          op monitor interval="10" timeout="60s"
> primitive p-drbd_vg.test1 ocf:linbit:drbd \
>          params drbd_resource="vg.test1" \
>          operations $id="ops-drbd_vg.test1" \
>          op start interval="0" timeout="240s" \
>          op stop interval="0" timeout="100s" \
>          op monitor interval="20" role="Master" timeout="20s" \
>          op monitor interval="30" role="Slave" timeout="20s"
> primitive p-libvirtd ocf:local:libvirtd \
>          meta allow-migrate="off" \
>          op start interval="0" timeout="200s" \
>          op stop interval="0" timeout="100s" \
>          op monitor interval="10" timeout="200s"
> primitive p-lvm_gh ocf:heartbeat:LVM \
>          params volgrpname="gh" \
>          meta allow-migrate="off" \
>          op start interval="0" timeout="90s" \
>          op stop interval="0" timeout="100s" \
>          op monitor interval="10" timeout="100s"
> primitive p-vd_vg.test1 ocf:heartbeat:VirtualDomain \
>          params config="/etc/libvirt/qemu/vg.test1.xml" \
>          params migration_transport="tcp" \
>          meta allow-migrate="true" is-managed="true" \
>          op start interval="0" timeout="120s" \
>          op stop interval="0" timeout="120s" \
>          op migrate_to interval="0" timeout="120s" \
>          op migrate_from interval="0" timeout="120s" \
>          op monitor interval="10s" timeout="120s"
> ms ms-drbd_vg.test1 p-drbd_vg.test1 \
>          meta resource-stickines="100" notify="true" master-max="2" target-role="Master"
> clone clone-libvirtd p-libvirtd \
>          meta interleave="true"
> clone clone-lvm_gh p-lvm_gh \
>          meta interleave="true"
> location cli-standby-p-vd_vg.test1 p-vd_vg.test1 \
>          rule $id="cli-standby-rule-p-vd_vg.test1" -inf: #uname eq breadnut2
> location loc-fencing-bn fencing-bn -inf: breadnut
> location loc-fencing-bn2 fencing-bn2 -inf: breadnut2
> colocation colo-vd_vg.test1 inf: p-vd_vg.test1:Started ms-drbd_vg.test1:Master clone-libvirtd:Started
> order order-vd_vg.test1 inf: ( ms-drbd_vg.test1:start clone-libvirtd:start ) p-vd_vg.test1:start
> order order-vg.test1 inf: clone-lvm_gh:start ms-drbd_vg.test1:start
> property $id="cib-bootstrap-options" \
>          dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
>          cluster-infrastructure="openais" \
>          default-resource-stickiness="1000" \
>          stonith-enabled="true" \
>          expected-quorum-votes="2" \
>          no-quorum-policy="ignore" \
>          last-lrm-refresh="1299128317"
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20110313/bf259e1e/attachment-0003.html>