[Pacemaker] Problem: monitor timeout causes cluster resource unmanaged and stopped on both nodes.

Thu Dec 17 09:09:01 EST 2009

Hi again,

I have been digging on the documentation, and thought I must answer my 
own questions, just to share them with the list, maybe someone will find 
them interesting too.

Oscar Remírez de Ganuza Satrústegui escribió:
>>>> What is happening here?? As it appears in the log, the timeout is 
>>>> suposed to
>>>> be 20s (20000ms), and the service jsut took 3s to shutdown.
>>>> Is it a problem with lrmd?
>>>>       
>>> Looks like it.
>>>     
>>
>> It could be that you were unlucky here and that the database
>> really took around 20 seconds to shutdown. If it is so, then
>>   
> Oh, thanks! You are right!
> The command to shutdown the mysql resource was sent at 20:12:55, but 
> the mysql service did not start shutting down until 20:13:14, 
> finishing at 20:13:17, (22 seconds > timeout (20 s))
>
> How is it possible to change the timeout for start or stop operations?
Have a look here:
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-operation-defaults.html#id525162
>
>> please increase your timeouts. You also mentioned somewhere that
>> 5s is set for a monitor timeout, that's way to low for any kind
>> of resource. There's a chapter on applications in HA environments
>> in a paper I recently presented (http://tinyurl.com/yg7u4bd).
>>   
> We had configured very low timeout for the monitors too. When I tried 
> today to change them, even the crm alerted me and advised me:
> crm(live)# configure edit
> WARNING: mysql-horde-nfs: timeout 10s for monitor_0 is smaller than 
> the advised 40
> WARNING: mysql-horde-service: timeout 10s for monitor_0 is smaller 
> than the advised 15
> WARNING: pingd: timeout 10s for monitor_0 is smaller than the advised 20
>
> I have read your paper and understand the importance of tunning 
> correctly the timeout values, in order not to cause false positives 
> and unavailabilities.
>
> Just two last questions:
> Is it 'normal' to set a resource as "unmanaged" just because the stop 
> operation was timed out once?
As found here 
(http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-failure-migration.html):
"Stop failures are slightly different and crucial. If a resource fails 
to stop and STONITH is enabled, then the cluster will fence the node in 
order to be able to start the resource elsewhere. If STONITH is not 
enabled, then the cluster has no way to continue and will not try to 
start the resource elsewhere, but will try to stop it again after the 
failure timeout."
> Is it possible to configure the cluster to try more than once to stop 
> a resource? (as it is possible to do for the start operation with the 
> cluster property start-failure-is-fatal="false")
I will configure the attribute failure-timeout and make some tests.

Thank you very much for your time building this software, and helping us 
to use it!

Regards,

---
Oscar Remírez de Ganuza
Servicios Informáticos
Universidad de Navarra
Ed. de Derecho, Campus Universitario
31080 Pamplona (Navarra), Spain
tfno: +34 948 425600 Ext. 3130
http://www.unav.es/SI

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 4422 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20091217/e8635b8c/attachment.bin>