[Pacemaker] resource agent starting out-of-order

Mon Mar 14 16:07:38 UTC 2011

Hi,

On Sun, Mar 13, 2011 at 11:15:25PM +0300, Pavel Levshin wrote:
> Hi.
>
> You have hit this:
>
> Mar  3 16:49:16 breadnut2 VirtualDomain[20709]: INFO: Virtual domain vg.test1 currently has no state, retrying.
> Mar  3 16:49:16 breadnut2 lrmd: [20694]: WARN: p-vd_vg.test1:monitor process (PID 20709) timed out (try 1).  Killing with signal SIGTERM (15).
> Mar  3 16:49:16 breadnut2 lrmd: [20694]: WARN: operation monitor[5] on ocf::VirtualDomain::p-vd_vg.test1 for client 20697, its parameters: crm_feature_set=[3.0.5] config=[/etc/libvirt/qemu/vg.test1.xml] CRM_meta_timeout=[20000] migration_transport=[tcp] : pid [20709] timed out
> Mar  3 16:49:16 breadnut2 crmd: [20697]: ERROR: process_lrm_event: LRM operation p-vd_vg.test1_monitor_0 (5) Timed Out (timeout=20000ms)
>
>
> When a cluster node comes up, it is directed to probe each clustered  
> resource on the node. This behaviour does not depend on constraints,  
> this check is mandatory.
>
> At the moment, libvirtd is not running yet. Thus, VirtualDomain RA is  
> unable to connect to it and to check if your VM is running. So it times  
> out after some time.
>
> Timeout of monitor action implies "unknown error" of the resource.  
> Pengine cannot ensure that your resource is not running, so it believes  
> it is, and stops the resource everywhere, then starts it again to 
> recover.
>
> This is what you get. How to work around is a different story. Frankly,  
> I don't see a decent way.
>
> VirtualDomain RA really cannot tell if VM is running while it cannot  
> connect to libvirtd. I'm not too sure, but your log suggests that  
> libvirtd will not be started until VirtualDomain monitor returns.
>
> I'd suggest you to start libvirtd before corosync, from initscripts, and  
> see if it helps.

Right.

> May anyone propose a cleaner solution?

No. The RA clearly states that libvirtd is required. The
corosync/heartbeat init scripts should have it as Should-Start.

Thanks,

Dejan

>
> --
> Pavel Levshin
>
>
> 03.03.2011 9:05, AP пишет:
>> Hi,
>>
>> Having deep issues with my cluster setup. Everything works ok until
>> I add a VirtualDomain RA in. Then things go pearshaped in that it seems
>> to ignore the "order" crm config for it and starts as soon as it can.
>>
>> The crm config is provided below. Basically p-vd_vg.test1 attempts to
>> start despite p-libvirtd not being started and p-drbd_vg.test1 not
>> being master (or slave for that matter - ie it's not configured at all).
>>
>> Eventually p-libvirtd and p-drbd_vg.test1 start and p-vd_vg.test1 attempts
>> to, pengine on the node where p-vd_vg.test1 is already running complains
>> with:
>>
>> Mar  3 16:49:16 breadnut pengine: [2097]: ERROR: native_create_actions: Resource p-vd_vg.test1 (ocf::VirtualDomain) is active on 2 nodes attempting recovery
>> Mar  3 16:49:16 breadnut pengine: [2097]: WARN: See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.
>>
>> Then mass slaughter occurs and p-vd_vg.test1 is restarted where it was
>> running previously whilst the other node gets an error for it.
>>
>> Essentially I cannot restart the 2nd node without it breaking the 1st.
>>
>> Now, as I understand it, a lone primitive will run once on any node - this
>> is just fine by me.
>>
>> colo-vd_vg.test1 indicates that p-vd_vg.test1 should run where ms-drbd_vg.test1
>> is master. ms-drbd_vg.test1 should only be master where clone-libvirtd is
>> started.
>>
>> order-vg.test1 indicates that ms-drbd_vg.test1 should start after clone-lvm_gh
>> is started (successfully). (This used to have a promote for ms-drbd_vg.test1
>> but then ms-drbd_vg.test1 would be demoted and not stopped on shutdown which
>> would cause clone-lvm_gh to error out on stop)
>>
>> order-vd_vg.test1 indicates p-vd_vg.test1 should only start where
>> ms-drbd_vg.test1 and clone-libvirtd have both successfully started (the
>> order of their starting being irrelevant).
>>
>> cli-standby-p-vd_vg.test1 was put there by my migrating p-vd_vg.test1
>> about the place.
>>
>> This happens with or without fencing and with fencing configured as below
>> or as just a single primited with both nodes in the hostlist.
>>
>> Help with this would be awesome and appreciated. I do not know what I am
>> missing here. The config makes sense to me so I don't even know where
>> to start poking and prodding. I be flailing.
>>
>> Config and s/w version list is below:
>>
>> OS: Debian Squeeze
>> Kernel: 2.6.37.2
>>
>> PACKAGES:
>>
>> ii  cluster-agents                      1:1.0.4-0ubuntu1~custom1     The reusable cluster components for Linux HA
>> ii  cluster-glue                        1.0.7-3ubuntu1~custom1       The reusable cluster components for Linux HA
>> ii  corosync                            1.3.0-1ubuntu1~custom1       Standards-based cluster framework (daemon and modules)
>> ii  libccs3                             3.1.0-0ubuntu1~custom1       Red Hat cluster suite - cluster configuration libraries
>> ii  libcib1                             1.1.5-0ubuntu1~ppa1~custom1  The Pacemaker libraries - CIB
>> ii  libcman3                            3.1.0-0ubuntu1~custom1       Red Hat cluster suite - cluster manager libraries
>> ii  libcorosync4                        1.3.0-1ubuntu1~custom1       Standards-based cluster framework (libraries)
>> ii  libcrmcluster1                      1.1.5-0ubuntu1~ppa1~custom1  The Pacemaker libraries - CRM
>> ii  libcrmcommon2                       1.1.5-0ubuntu1~ppa1~custom1  The Pacemaker libraries - common CRM
>> ii  libfence4                           3.1.0-0ubuntu1~custom1       Red Hat cluster suite - fence client library
>> ii  liblrm2                             1.0.7-3ubuntu1~custom1       Reusable cluster libraries -- liblrm2
>> ii  libpe-rules2                        1.1.5-0ubuntu1~ppa1~custom1  The Pacemaker libraries - rules for P-Engine
>> ii  libpe-status3                       1.1.5-0ubuntu1~ppa1~custom1  The Pacemaker libraries - status for P-Engine
>> ii  libpengine3                         1.1.5-0ubuntu1~ppa1~custom1  The Pacemaker libraries - P-Engine
>> ii  libpils2                            1.0.7-3ubuntu1~custom1       Reusable cluster libraries -- libpils2
>> ii  libplumb2                           1.0.7-3ubuntu1~custom1       Reusable cluster libraries -- libplumb2
>> ii  libplumbgpl2                        1.0.7-3ubuntu1~custom1       Reusable cluster libraries -- libplumbgpl2
>> ii  libstonith1                         1.0.7-3ubuntu1~custom1       Reusable cluster libraries -- libstonith1
>> ii  libstonithd1                        1.1.5-0ubuntu1~ppa1~custom1  The Pacemaker libraries - stonith
>> ii  libtransitioner1                    1.1.5-0ubuntu1~ppa1~custom1  The Pacemaker libraries - transitioner
>> ii  pacemaker                           1.1.5-0ubuntu1~ppa1~custom1  HA cluster resource manager
>>
>> CONFIG:
>>
>> node breadnut
>> node breadnut2 \
>>          attributes standby="off"
>> primitive fencing-bn stonith:meatware \
>>          params hostlist="breadnut" \
>>          op start interval="0" timeout="60s" \
>>          op stop interval="0" timeout="70s" \
>>          op monitor interval="10" timeout="60s"
>> primitive fencing-bn2 stonith:meatware \
>>          params hostlist="breadnut2" \
>>          op start interval="0" timeout="60s" \
>>          op stop interval="0" timeout="70s" \
>>          op monitor interval="10" timeout="60s"
>> primitive p-drbd_vg.test1 ocf:linbit:drbd \
>>          params drbd_resource="vg.test1" \
>>          operations $id="ops-drbd_vg.test1" \
>>          op start interval="0" timeout="240s" \
>>          op stop interval="0" timeout="100s" \
>>          op monitor interval="20" role="Master" timeout="20s" \
>>          op monitor interval="30" role="Slave" timeout="20s"
>> primitive p-libvirtd ocf:local:libvirtd \
>>          meta allow-migrate="off" \
>>          op start interval="0" timeout="200s" \
>>          op stop interval="0" timeout="100s" \
>>          op monitor interval="10" timeout="200s"
>> primitive p-lvm_gh ocf:heartbeat:LVM \
>>          params volgrpname="gh" \
>>          meta allow-migrate="off" \
>>          op start interval="0" timeout="90s" \
>>          op stop interval="0" timeout="100s" \
>>          op monitor interval="10" timeout="100s"
>> primitive p-vd_vg.test1 ocf:heartbeat:VirtualDomain \
>>          params config="/etc/libvirt/qemu/vg.test1.xml" \
>>          params migration_transport="tcp" \
>>          meta allow-migrate="true" is-managed="true" \
>>          op start interval="0" timeout="120s" \
>>          op stop interval="0" timeout="120s" \
>>          op migrate_to interval="0" timeout="120s" \
>>          op migrate_from interval="0" timeout="120s" \
>>          op monitor interval="10s" timeout="120s"
>> ms ms-drbd_vg.test1 p-drbd_vg.test1 \
>>          meta resource-stickines="100" notify="true" master-max="2" target-role="Master"
>> clone clone-libvirtd p-libvirtd \
>>          meta interleave="true"
>> clone clone-lvm_gh p-lvm_gh \
>>          meta interleave="true"
>> location cli-standby-p-vd_vg.test1 p-vd_vg.test1 \
>>          rule $id="cli-standby-rule-p-vd_vg.test1" -inf: #uname eq breadnut2
>> location loc-fencing-bn fencing-bn -inf: breadnut
>> location loc-fencing-bn2 fencing-bn2 -inf: breadnut2
>> colocation colo-vd_vg.test1 inf: p-vd_vg.test1:Started ms-drbd_vg.test1:Master clone-libvirtd:Started
>> order order-vd_vg.test1 inf: ( ms-drbd_vg.test1:start clone-libvirtd:start ) p-vd_vg.test1:start
>> order order-vg.test1 inf: clone-lvm_gh:start ms-drbd_vg.test1:start
>> property $id="cib-bootstrap-options" \
>>          dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
>>          cluster-infrastructure="openais" \
>>          default-resource-stickiness="1000" \
>>          stonith-enabled="true" \
>>          expected-quorum-votes="2" \
>>          no-quorum-policy="ignore" \
>>          last-lrm-refresh="1299128317"
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker