[Pacemaker] 1st monitor is too fast after the start

Dan Frincu dfrincu at streamwide.ro
Wed Oct 13 07:48:08 UTC 2010


Hi,

I've noticed the same type of behavior, however in a different context, 
my setup includes 3 drbd devices and a group of resources, all have to 
run on the same node and move together to other nodes. My issue was with 
the first resource that required access to a drbd device, which was the 
ocf:heartbeat:Filesystem RA trying to do a mount and failing.

The reason, it was trying to do the mount of the drbd device before the 
drbd device had finished migrating to primary state. Same as you, I 
introduced a start-delay, but on the start action. This proved to be of 
no use as the behavior persisted, even with an increased start-delay. 
However, it only happened when performing a fail-back operation, during 
fail-over, everything was ok, during fail-back, error.

The fix I've made was to remove any start-delay and to add group 
collocation constraints to all ms_drbd resources. Before that I only had 
one collocation constraint for the drbd device being promoted last.

I hope this helps.

Regards,

Dan

Pavlos Parissis wrote:
> Hi,
>
> I noticed a race condition while I was integration an application with
> Pacemaker and thought to share with you.
>
> The init script of the application is LSB-compliant and passes the
> tests mentioned at the Pacemaker documentation. Moreover, the init
> script
> uses the supplied functions from the system[1] for starting,stopping
> and checking the application.
>
> I observed few times that the monitor action was failing after the
> startup of the cluster or the movement of the resource group.
> Because it was not happening always and manual start/status was always
> working, it was quite tricky and difficult to find out the root cause
> of the failure.
> After few hours of troubleshooting, I found out that the 1st monitor
> action after the start action, was executed too fast for the
> application to create the pid file. As result monitor action was
> receiving error.
>
> I know it sounds a bit strange but it happened on my systems. The fact
> that my systems are basically vmware images on a laptop could have a
> relation with the issue.
>
> Nevertheless, I would like to ask if you are thinking to implement an
> "init_wait" on 1st monitor action. Could be useful.
>
> To solve my issue I put a sleep after the start of the application in
> the init script. This gives enough time for the application to create
> its pid file and the 1st monitor doesn't fail.
>
>
> Cheers,
> Pavlos
>
>
> [1] Cent0S 5.4
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>   

-- 
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania





More information about the Pacemaker mailing list