[Pacemaker] Problem: monitor timeout causes cluster resource unmanaged and stopped on both nodes.
Oscar Remírez de Ganuza Satrústegui
oscarrdg at unav.es
Wed Dec 16 11:55:10 EST 2009
Hello everybody,
We have a shared disk cluster with pacemaker (1.0.6) and corosync
(1.1.2). Two nodes, active (herculespre) - passive (jasonpre). MySQL
service as a resource, inside a group with attribute
migration-threshold="5" (so that if a resource fail-count is 5 or more,
it will be moved to the passive node)
Yesterday we launched some mysql benchmarking and found out that a
resource monitor timeout brought us to the resource unmanaged and
stopped on both nodes!!
> crm_mon
...
Resource Group: mysql-horde-group
mysql-horde-nfs (ocf::heartbeat:Filesystem): Started herculespre
mysql-horde-service (lsb:mysql-horde): Started
herculespre (unmanaged) FAILED <------ !!!!
I can not understand why this situation came out. Maybe we have not
properly configured the cluster? Or could it be another problem?
Analyzing the problem step by step:
1. The resource monitor timed out, and the fail-count got updated +1:
Dec 15 20:12:54 herculespre lrmd: [8559]: WARN:
mysql-horde-service:monitor process (PID 12229) timed out (try 1).
Killing with signal SIGTERM (15).
Dec 15 20:12:54 herculespre lrmd: [8559]: WARN: operation monitor[37] on
lsb::mysql-horde::mysql-horde-service for client 8562, its parameters:
CRM_meta_interval=[10000] CRM_meta_timeout=[5000]
crm_feature_set=[3.0.1] CRM_meta_name=[monitor] : pid [12229] timed out
Dec 15 20:12:54 herculespre crmd: [8562]: ERROR: process_lrm_event: LRM
operation mysql-horde-service_monitor_10000 (37) Timed Out (timeout=5000ms)
Dec 15 20:12:54 herculespre attrd: [8560]: info: attrd_local_callback:
Expanded fail-count-mysql-horde-service=value++ to 1
That's ok. It could be that our mysql service was busy performing all
the test operations, and took more that 5s to answer.
2. The CRM decided to stop the service.
Dec 15 20:12:55 herculespre crmd: [8562]: info: do_lrm_rsc_op:
Performing key=4:1379:0:ae99a943-f4b7-4979-b0c9-09c7f9dd0f9f
op=mysql-horde-service_stop_0 )
Dec 15 20:12:55 herculespre lrmd: [8559]: info:
rsc:mysql-horde-service:38: stop
3. The MySQL service received the order and shutted down properly. From
mysql.log:
091215 20:13:14 [Note] /usr/local/etc2/mysql-horde/libexec/mysqld:
Normal shutdown
...
091215 20:13:17 [Note] /usr/local/etc2/mysql-horde/libexec/mysqld:
Shutdown complete
4. Here comes the problem: the cluster did not received the confirmation
that the service was properly shutted down:
Dec 15 20:13:17 herculespre lrmd: [8559]: WARN: mysql-horde-service:stop
process (PID 12270) timed out (try 1). Killing with signal SIGTERM (15).
Dec 15 20:13:17 herculespre lrmd: [8559]: WARN: operation stop[38] on
lsb::mysql-horde::mysql-horde-service for client 8562, its parameters:
CRM_meta_timeout=[20000] crm_feature_set=[3.0.1] : pid [12270] timed out
Dec 15 20:13:17 herculespre crmd: [8562]: ERROR: process_lrm_event: LRM
operation mysql-horde-service_stop_0 (38) Timed Out (timeout=20000ms)
What is happening here?? As it appears in the log, the timeout is
suposed to be 20s (20000ms), and the service jsut took 3s to shutdown.
Is it a problem with lrmd?
5. As a stop-action has failed, the fail-count got updated to INFINITY
Dec 15 20:13:17 herculespre attrd: [8560]: info: attrd_perform_update:
Sent update 43: fail-count-mysql-horde-service=INFINITY
The final output is that the resource is unmanaged and stopped on both
nodes!!!
We have not configured any stonith resource, as the servers are on
Vmware Virtual Center and we have not found a working solution. We could
not stopped the crmd process properly, and we had to kill it.
When we rebooted the node, this situation got solved, so we reckon that
with a stonith resource this could have been solved automatically.
But we still do not understand what happened here and why the resource
got "unmanaged".
Could this be a bug?
I attached the full logs and the configuration, as I did not want to
spam the list.
We will highly appreciate any help understanding what could have
happened wrong.
Thank you very much for your help!!
Regards,
---
Oscar Remírez de Ganuza
Servicios Informáticos
Universidad de Navarra
Ed. de Derecho, Campus Universitario
31080 Pamplona (Navarra), Spain
tfno: +34 948 425600 Ext. 3130
http://www.unav.es/SI
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: configure_show-20091215
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20091216/810078a8/attachment.bat>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: crm_mon-20091215
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20091216/810078a8/attachment-0001.bat>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: messages-20091215
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20091216/810078a8/attachment-0002.bat>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: mysql.log-20091215
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20091216/810078a8/attachment-0003.bat>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 4422 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20091216/810078a8/attachment.bin>
More information about the Pacemaker
mailing list