[Pacemaker] Problem: monitor timeout causes cluster resource unmanaged and stopped on both nodes.

Thu Dec 17 09:48:01 UTC 2009

Hi,

On Thu, Dec 17, 2009 at 09:18:20AM +0100, Andrew Beekhof wrote:
> On Wed, Dec 16, 2009 at 5:55 PM, Oscar Remírez de Ganuza Satrústegui
> <oscarrdg at unav.es> wrote:
> 
> [snip]
> 
> > 2. The CRM decided to stop the service.
> > Dec 15 20:12:55 herculespre crmd: [8562]: info: do_lrm_rsc_op: Performing
> > key=4:1379:0:ae99a943-f4b7-4979-b0c9-09c7f9dd0f9f
> > op=mysql-horde-service_stop_0 )
> > Dec 15 20:12:55 herculespre lrmd: [8559]: info: rsc:mysql-horde-service:38:
> > stop
> >
> > 3. The MySQL service received the order and shutted down properly. From
> > mysql.log:
> > 091215 20:13:14 [Note] /usr/local/etc2/mysql-horde/libexec/mysqld: Normal
> > shutdown
> > ...
> > 091215 20:13:17 [Note] /usr/local/etc2/mysql-horde/libexec/mysqld: Shutdown
> > complete
> >
> > 4. Here comes the problem: the cluster did not received the confirmation
> > that the service was properly shutted down:
> > Dec 15 20:13:17 herculespre lrmd: [8559]: WARN: mysql-horde-service:stop
> > process (PID 12270) timed out (try 1). Killing with signal SIGTERM (15).
> > Dec 15 20:13:17 herculespre lrmd: [8559]: WARN: operation stop[38] on
> > lsb::mysql-horde::mysql-horde-service for client 8562, its parameters:
> > CRM_meta_timeout=[20000] crm_feature_set=[3.0.1] : pid [12270] timed out
> > Dec 15 20:13:17 herculespre crmd: [8562]: ERROR: process_lrm_event: LRM
> > operation mysql-horde-service_stop_0 (38) Timed Out (timeout=20000ms)
> >
> > What is happening here?? As it appears in the log, the timeout is suposed to
> > be 20s (20000ms), and the service jsut took 3s to shutdown.
> > Is it a problem with lrmd?
> 
> Looks like it.

Don't think so. Here's the logs again:

Dec 15 20:12:55 herculespre lrmd: [8559]: info: rsc:mysql-horde-service:38: stop

lrmd invokes the RA to stop mysql. Whatever happened between this
time and the following.

20:13:14 [Note] /usr/local/etc2/mysql-horde/libexec/mysqld: Normal shutdown
20:13:17 [Note] /usr/local/etc2/mysql-horde/libexec/mysqld: Shutdown
Dec 15 20:13:17 herculespre lrmd: [8559]: WARN: mysql-horde-service:stop
process (PID 12270) timed out (try 1). Killing with signal SIGTERM (15).

It could be that you were unlucky here and that the database
really took around 20 seconds to shutdown. If it is so, then
please increase your timeouts. You also mentioned somewhere that
5s is set for a monitor timeout, that's way to low for any kind
of resource. There's a chapter on applications in HA environments
in a paper I recently presented (http://tinyurl.com/yg7u4bd).

Thanks,

Dejan

> Given the time of year, it would probably be a good idea to create a
> bugzilla entry so that this doesn't get lost.
> 
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker