[Pacemaker] 1st monitor is too fast after the start

Wed Oct 13 08:50:42 UTC 2010

Pavlos Parissis wrote:
> On 13 October 2010 09:48, Dan Frincu <dfrincu at streamwide.ro> wrote:
>   
>> Hi,
>>
>> I've noticed the same type of behavior, however in a different context, my
>> setup includes 3 drbd devices and a group of resources, all have to run on
>> the same node and move together to other nodes. My issue was with the first
>> resource that required access to a drbd device, which was the
>> ocf:heartbeat:Filesystem RA trying to do a mount and failing.
>>
>> The reason, it was trying to do the mount of the drbd device before the drbd
>> device had finished migrating to primary state. Same as you, I introduced a
>> start-delay, but on the start action. This proved to be of no use as the
>> behavior persisted, even with an increased start-delay. However, it only
>> happened when performing a fail-back operation, during fail-over, everything
>> was ok, during fail-back, error.
>>
>> The fix I've made was to remove any start-delay and to add group collocation
>> constraints to all ms_drbd resources. Before that I only had one collocation
>> constraint for the drbd device being promoted last.
>>
>> I hope this helps.
>>
>>     
>
> I am glad that somebody else experienced the same issue:)
>
> On my mail I was talking about the monitor action which was failing,
> but the behavior you described happened on my system on the same
> setup, drbd and fs resource.It also happened on the application
> resource, the start was too fast and the FS was not mounted (yet) when
> the action start fired for the application resource. A delay on start
> function of the resource agent of the application fixed my issue.
>
> In my setup I have all the necessary constraints to avoid this, at
> least this is what I believe so:-)
>
> Cheers,
> Pavlos
>   
 From what I see you have a dual primary setup with failover on the 
third node, basically if you have one drbd resource for which you have 
both ordering and collocation, I don't think you need to "improve" it, 
if it ain't broke, don't fix it :)

Regards,

Dan
>
> [root at node-01 sysconfig]# crm configure show
> node $id="059313ce-c6aa-4bd5-a4fb-4b781de6d98f" node-03
> node $id="d791b1f5-9522-4c84-a66f-cd3d4e476b38" node-02
> node $id="e388e797-21f4-4bbe-a588-93d12964b4d7" node-01
> primitive drbd_01 ocf:linbit:drbd \
>         params drbd_resource="drbd_pbx_service_1" \
>         op monitor interval="30s" \
>         op start interval="0" timeout="240s" \
>         op stop interval="0" timeout="120s"
> primitive drbd_02 ocf:linbit:drbd \
>         params drbd_resource="drbd_pbx_service_2" \
>         op monitor interval="30s" \
>         op start interval="0" timeout="240s" \
>         op stop interval="0" timeout="120s"
> primitive fs_01 ocf:heartbeat:Filesystem \
>         params device="/dev/drbd1" directory="/pbx_service_01" fstype="ext3" \
>         meta migration-threshold="3" failure-timeout="60" \
>         op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \
>         op start interval="0" timeout="60s" \
>         op stop interval="0" timeout="60s"
> primitive fs_02 ocf:heartbeat:Filesystem \
>         params device="/dev/drbd2" directory="/pbx_service_02" fstype="ext3" \
>         meta migration-threshold="3" failure-timeout="60" \
>         op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \
>         op start interval="0" timeout="60s" \
>         op stop interval="0" timeout="60s"
> primitive ip_01 ocf:heartbeat:IPaddr2 \
>         params ip="192.168.78.10" cidr_netmask="24" broadcast="192.168.78.255" \
>         meta failure-timeout="120" migration-threshold="3" \
>         op monitor interval="5s"
> primitive ip_02 ocf:heartbeat:IPaddr2 \
>         meta failure-timeout="120" migration-threshold="3" \
>         params ip="192.168.78.20" cidr_netmask="24" broadcast="192.168.78.255" \
>         op monitor interval="5s"
> primitive pbx_01 lsb:znd-pbx_01 \
>         meta migration-threshold="3" failure-timeout="60"
> target-role="Started" \
>         op monitor interval="20s" timeout="20s" \
>         op start interval="0" timeout="60s" \
>         op stop interval="0" timeout="60s"
> primitive pbx_02 lsb:znd-pbx_02 \
>         meta migration-threshold="3" failure-timeout="60" \
>         op monitor interval="20s" timeout="20s" \
>         op start interval="0" timeout="60s" \
>         op stop interval="0" timeout="60s"
> primitive sshd_01 lsb:znd-sshd-pbx_01 \
>         meta target-role="Started" is-managed="true" \
>         op monitor on-fail="stop" interval="10m" \
>         op start interval="0" timeout="60s" on-fail="stop" \
>         op stop interval="0" timeout="60s" on-fail="stop"
> primitive sshd_02 lsb:znd-sshd-pbx_02 \
>         meta target-role="Started" \
>         op monitor on-fail="stop" interval="10m" \
>         op start interval="0" timeout="60s" on-fail="stop" \
>         op stop interval="0" timeout="60s" on-fail="stop"
> group pbx_service_01 ip_01 fs_01 pbx_01 sshd_01 \
>         meta target-role="Started"
> group pbx_service_02 ip_02 fs_02 pbx_02 sshd_02
> ms ms-drbd_01 drbd_01 \
>         meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true" target-role="Started"
> ms ms-drbd_02 drbd_02 \
>         meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true" target-role="Started"
> location PrimaryNode-drbd_01 ms-drbd_01 100: node-01
> location PrimaryNode-drbd_02 ms-drbd_02 100: node-02
> location PrimaryNode-pbx_service_01 pbx_service_01 200: node-01
> location PrimaryNode-pbx_service_02 pbx_service_02 200: node-02
> location SecondaryNode-drbd_01 ms-drbd_01 0: node-03
> location SecondaryNode-drbd_02 ms-drbd_02 0: node-03
> location SecondaryNode-pbx_service_01 pbx_service_01 10: node-03
> location SecondaryNode-pbx_service_02 pbx_service_02 10: node-03
> colocation fs_01-on-drbd_01 inf: fs_01 ms-drbd_01:Master
> colocation fs_02-on-drbd_02 inf: fs_02 ms-drbd_02:Master
> order pbx_service_01-after-drbd_01 inf: ms-drbd_01:promote pbx_service_01:start
> order pbx_service_02-after-drbd_02 inf: ms-drbd_02:promote pbx_service_02:start
> property $id="cib-bootstrap-options" \
>         dc-version="1.1.3-9c2342c0378140df9bed7d192f2b9ed157908007" \
>         cluster-infrastructure="Heartbeat" \
>         symmetric-cluster="false" \
>         stonith-enabled="false" \
>         last-lrm-refresh="1286895296"
> rsc_defaults $id="rsc-options" \
>         resource-stickiness="1000"
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>   

-- 
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20101013/297bad31/attachment.htm>