[Pacemaker] Problem: monitor timeout causes cluster resource unmanaged and stopped on both nodes.

Thu Dec 17 13:46:38 UTC 2009

Hi,

Dejan Muhamedagic escribió:
> Hi,
>
> On Thu, Dec 17, 2009 at 09:18:20AM +0100, Andrew Beekhof wrote:
>   
>> On Wed, Dec 16, 2009 at 5:55 PM, Oscar Remírez de Ganuza Satrústegui
>> <oscarrdg at unav.es> wrote:
>>
>> [snip]
>>     
>>> What is happening here?? As it appears in the log, the timeout is suposed to
>>> be 20s (20000ms), and the service jsut took 3s to shutdown.
>>> Is it a problem with lrmd?
>>>       
>> Looks like it.
>>     
>
> Don't think so. Here's the logs again:
>
> Dec 15 20:12:55 herculespre lrmd: [8559]: info: rsc:mysql-horde-service:38: stop
>
> lrmd invokes the RA to stop mysql. Whatever happened between this
> time and the following.
>
> 20:13:14 [Note] /usr/local/etc2/mysql-horde/libexec/mysqld: Normal shutdown
> 20:13:17 [Note] /usr/local/etc2/mysql-horde/libexec/mysqld: Shutdown
> Dec 15 20:13:17 herculespre lrmd: [8559]: WARN: mysql-horde-service:stop
> process (PID 12270) timed out (try 1). Killing with signal SIGTERM (15).
>
> It could be that you were unlucky here and that the database
> really took around 20 seconds to shutdown. If it is so, then
>   
Oh, thanks! You are right!
The command to shutdown the mysql resource was sent at 20:12:55, but the 
mysql service did not start shutting down until 20:13:14, finishing at 
20:13:17, (22 seconds > timeout (20 s))

How is it possible to change the timeout for start or stop operations?

> please increase your timeouts. You also mentioned somewhere that
> 5s is set for a monitor timeout, that's way to low for any kind
> of resource. There's a chapter on applications in HA environments
> in a paper I recently presented (http://tinyurl.com/yg7u4bd).
>   
We had configured very low timeout for the monitors too. When I tried 
today to change them, even the crm alerted me and advised me:
crm(live)# configure edit
WARNING: mysql-horde-nfs: timeout 10s for monitor_0 is smaller than the 
advised 40
WARNING: mysql-horde-service: timeout 10s for monitor_0 is smaller than 
the advised 15
WARNING: pingd: timeout 10s for monitor_0 is smaller than the advised 20

I have read your paper and understand the importance of tunning 
correctly the timeout values, in order not to cause false positives and 
unavailabilities.

Just two last questions:
Is it 'normal' to set a resource as "unmanaged" just because the stop 
operation was timed out once?
Is it possible to configure the cluster to try more than once to stop a 
resource? (as it is possible to do for the start operation with the 
cluster property start-failure-is-fatal="false")

Thank you very much for your help!
I really appreciate it!

Regards,

---
Oscar Remírez de Ganuza
Servicios Informáticos
Universidad de Navarra
Ed. de Derecho, Campus Universitario
31080 Pamplona (Navarra), Spain
tfno: +34 948 425600 Ext. 3130
http://www.unav.es/SI 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 4422 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20091217/0729dad1/attachment-0004.bin>