[Pacemaker] Problem: monitor timeout causes cluster resource unmanaged and stopped on both nodes.

Wed Dec 16 11:55:10 EST 2009

Hello everybody,

We have a shared disk cluster with pacemaker (1.0.6) and corosync 
(1.1.2). Two nodes, active (herculespre) - passive (jasonpre). MySQL 
service as a resource, inside a group with attribute 
migration-threshold="5" (so that if a resource fail-count is 5 or more, 
it will be moved to the passive node)

Yesterday we launched some mysql benchmarking and found out that a 
resource monitor timeout brought us to the resource unmanaged and 
stopped on both nodes!!

 > crm_mon
...
Resource Group: mysql-horde-group
     mysql-horde-nfs    (ocf::heartbeat:Filesystem):    Started herculespre
     mysql-horde-service        (lsb:mysql-horde):      Started 
herculespre (unmanaged) FAILED <------ !!!!

I can not understand why this situation came out. Maybe we have not 
properly configured the cluster? Or could it be another problem?

Analyzing the problem step by step:

1. The resource monitor timed out, and the fail-count got updated +1:
Dec 15 20:12:54 herculespre lrmd: [8559]: WARN: 
mysql-horde-service:monitor process (PID 12229) timed out (try 1). 
Killing with signal SIGTERM (15).
Dec 15 20:12:54 herculespre lrmd: [8559]: WARN: operation monitor[37] on 
lsb::mysql-horde::mysql-horde-service for client 8562, its parameters: 
CRM_meta_interval=[10000] CRM_meta_timeout=[5000] 
crm_feature_set=[3.0.1] CRM_meta_name=[monitor] : pid [12229] timed out
Dec 15 20:12:54 herculespre crmd: [8562]: ERROR: process_lrm_event: LRM 
operation mysql-horde-service_monitor_10000 (37) Timed Out (timeout=5000ms)
Dec 15 20:12:54 herculespre attrd: [8560]: info: attrd_local_callback: 
Expanded fail-count-mysql-horde-service=value++ to 1

That's ok. It could be that our mysql service was busy performing all 
the test operations, and took more that 5s to answer.

2. The CRM decided to stop the service.
Dec 15 20:12:55 herculespre crmd: [8562]: info: do_lrm_rsc_op: 
Performing key=4:1379:0:ae99a943-f4b7-4979-b0c9-09c7f9dd0f9f 
op=mysql-horde-service_stop_0 )
Dec 15 20:12:55 herculespre lrmd: [8559]: info: 
rsc:mysql-horde-service:38: stop

3. The MySQL service received the order and shutted down properly. From 
mysql.log:
091215 20:13:14 [Note] /usr/local/etc2/mysql-horde/libexec/mysqld: 
Normal shutdown
...
091215 20:13:17 [Note] /usr/local/etc2/mysql-horde/libexec/mysqld: 
Shutdown complete

4. Here comes the problem: the cluster did not received the confirmation 
that the service was properly shutted down:
Dec 15 20:13:17 herculespre lrmd: [8559]: WARN: mysql-horde-service:stop 
process (PID 12270) timed out (try 1). Killing with signal SIGTERM (15).
Dec 15 20:13:17 herculespre lrmd: [8559]: WARN: operation stop[38] on 
lsb::mysql-horde::mysql-horde-service for client 8562, its parameters: 
CRM_meta_timeout=[20000] crm_feature_set=[3.0.1] : pid [12270] timed out
Dec 15 20:13:17 herculespre crmd: [8562]: ERROR: process_lrm_event: LRM 
operation mysql-horde-service_stop_0 (38) Timed Out (timeout=20000ms)

What is happening here?? As it appears in the log, the timeout is 
suposed to be 20s (20000ms), and the service jsut took 3s to shutdown.
Is it a problem with lrmd?

5. As a stop-action has failed, the fail-count got updated to INFINITY
Dec 15 20:13:17 herculespre attrd: [8560]: info: attrd_perform_update: 
Sent update 43: fail-count-mysql-horde-service=INFINITY

The final output is that the resource is unmanaged and stopped on both 
nodes!!!

We have not configured any stonith resource, as the servers are on 
Vmware Virtual Center and we have not found a working solution. We could 
not stopped the crmd process properly, and we had to kill it.
When we rebooted the node, this situation got solved, so we reckon that 
with a stonith resource this could have been solved automatically.
But we still do not understand what happened here and why the resource 
got "unmanaged".
Could this be a bug?

I attached the full logs and the configuration, as I did not want to 
spam the list.

We will highly appreciate any help understanding what could have 
happened wrong.

Thank you very much for your help!!

Regards,

---
Oscar Remírez de Ganuza
Servicios Informáticos
Universidad de Navarra
Ed. de Derecho, Campus Universitario
31080 Pamplona (Navarra), Spain
tfno: +34 948 425600 Ext. 3130
http://www.unav.es/SI
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: configure_show-20091215
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20091216/810078a8/attachment.bat>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: crm_mon-20091215
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20091216/810078a8/attachment-0001.bat>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: messages-20091215
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20091216/810078a8/attachment-0002.bat>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: mysql.log-20091215
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20091216/810078a8/attachment-0003.bat>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 4422 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20091216/810078a8/attachment.bin>