[Pacemaker] Pacemaker cluster took almost 2 hours to migrate
Sergey A. Tachenov
stachenov at runbox.com
Mon Mar 24 09:23:46 UTC 2014
Hi,
We have a 2-node Pacemaker (1.1.7 on Fedora Core 16) cluster that runs a
PostgreSQL database on a DRBD drive, Apache, Tomcat and a few custom
services. The load is pretty low (a few HTTP clients, a few database
clients, infrequent database updates - every hour or so). It worked
perfectly for more than a year, surviving three hardware failures with
no problems.
Now something strange happened. It started with these messages:
Mar 23 07:21:13 srvplan1 lrmd: [1125]: WARN: ping:0:monitor process (PID
10011) timed out (try 1). Killing with signal SIGTERM (15).
Mar 23 07:21:13 srvplan1 lrmd: [1125]: info: ping:0:monitor process (PID
-10011) died before killing (try 1)
Mar 23 07:21:17 srvplan1 lrmd: [1125]: WARN: TcpPlanServer:0:monitor
process (PID 10012) timed out (try 1). Killing with signal SIGTERM (15).
Mar 23 07:21:17 srvplan1 lrmd: [1125]: info: TcpPlanServer:0:monitor
process (PID -10012) died before killing (try 1)
Mar 23 07:21:17 srvplan1 lrmd: [1125]: WARN: PlanServer:0:monitor
process (PID 10013) timed out (try 1). Killing with signal SIGTERM (15).
Mar 23 07:21:17 srvplan1 lrmd: [1125]: info: PlanServer:0:monitor
process (PID -10013) died before killing (try 1)
Mar 23 07:21:17 srvplan1 lrmd: [1125]: WARN: PlanUpdateClient:0:monitor
process (PID 10014) timed out (try 1). Killing with signal SIGTERM (15).
Mar 23 07:21:17 srvplan1 lrmd: [1125]: info: PlanUpdateClient:0:monitor
process (PID -10014) died before killing (try 1)
It all looks weird to me because if the processes hanged, then it is
very unlikely that they all suddenly died before killing. Then there was
silence for more than 45 minutes, no even the usual "monitor" messages.
Moreover, at this point the web services were already unresponsive.
Then this:
Mar 23 08:06:38 srvplan1 lrmd: [1125]: info: rsc:postgresql:47: monitor
Mar 23 08:06:38 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the
operation operation monitor[47] on ocf::pgsql::postgresql for client
1128, its parameters: crm_feature_set=[3.0.6] depth=[0]
CRM_meta_name=[monitor] CRM_meta_timeout=[30000]
CRM_meta_interval=[30000] CRM_meta_depth=[0] pgdata=[/hd2/pgsql] stayed
in operation list for 2730970 ms (longer than 10000 ms)
Mar 23 08:06:38 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the
operation operation monitor[29] on ocf::tomcat::tomcat:0 for client
1128, its parameters: java_home=[/usr/lib/jvm/jre] tomcat_user=[tomcat]
catalina_home=[/usr/share/tomcat] CRM_meta_timeout=[30000]
CRM_meta_name=[monitor] crm_feature_set=[3.0.6] CRM_meta_notify=[false]
CRM_meta_clone_node_max=[1] CRM_meta_clone=[0] CRM_meta_clone_max=[2]
CRM_meta_interval=[30000] script_log=[/home/tmo/log/tomcat.log]
statusurl=[http://127.0.0.1:8080/status/]
CRM_meta_globally_unique=[false] stayed in operation list for 2717870 ms
(longer than 10000 ms)
Mar 23 08:06:38 srvplan1 lrmd: [1125]: info: rsc:DBIP:45: monitor
Mar 23 08:06:38 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the
operation operation monitor[45] on ocf::IPaddr2::DBIP for client 1128,
its parameters: cidr_netmask=[24] crm_feature_set=[3.0.6]
CRM_meta_name=[monitor] CRM_meta_interval=[10000]
CRM_meta_timeout=[20000] ip=[xxx.xxx.xxx.185] stayed in operation list
for 2737850 ms (longer than 10000 ms)
Mar 23 08:06:39 srvplan1 lrmd: [1125]: info: rsc:drbd_pgdrive:0:42:
monitor
Mar 23 08:06:39 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the
operation operation monitor[42] on ocf::drbd::drbd_pgdrive:0 for client
1128, its parameters: drbd_resource=[pgdrive] CRM_meta_role=[Master]
CRM_meta_notify_stop_resource=[ ]
CRM_meta_notify_start_resource=[drbd_pgdrive:0 ]
CRM_meta_notify_inactive_resource=[drbd_pgdrive:0 ]
CRM_meta_notify_master_uname=[srvplan2 ] CRM_meta_timeout=[60000]
CRM_meta_name=[monitor] CRM_meta_notify_demote_resource=[ ]
CRM_meta_notify_promote_uname=[ ] crm_feature_set=[3.0.6]
CRM_meta_notify=[true] CRM_meta_notify_start_uname=[srvplan1 stayed in
operation list for 2738840 ms (longer than 10000 ms)
Mar 23 08:06:40 srvplan1 lrmd: [1125]: info: rsc:apache:0:26: monitor
Mar 23 08:06:40 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the
operation operation monitor[26] on ocf::apache::apache:0 for client
1128, its parameters: testregex=[</html>] CRM_meta_timeout=[30000]
CRM_meta_name=[monitor] crm_feature_set=[3.0.6] CRM_meta_notify=[false]
configfile=[/etc/httpd/conf/httpd.conf] CRM_meta_clone_node_max=[1]
CRM_meta_clone=[0] CRM_meta_clone_max=[2] CRM_meta_interval=[10000]
CRM_meta_globally_unique=[false] stayed in operation list for 2739830 ms
(longer than 10000 ms)
Mar 23 08:06:47 srvplan1 lrmd: [1125]: info: cancel_op: operation
monitor[29] on ocf::tomcat::tomcat:0 for client 1128, its parameters:
java_home=[/usr/lib/jvm/jre] tomcat_user=[tomcat]
catalina_home=[/usr/share/tomcat] CRM_meta_timeout=[30000]
CRM_meta_name=[monitor] crm_feature_set=[3.0.6] CRM_meta_notify=[false]
CRM_meta_clone_node_max=[1] CRM_meta_clone=[0] CRM_meta_clone_max=[2]
CRM_meta_interval=[30000] script_log=[/home/tmo/log/tomcat.log]
statusurl=[http://127.0.0.1:8080/status/]
CRM_meta_globally_unique=[false] cancelled
These 2738840 ms look like that 45-something minutes between these and
the previous messages. Then the situation evolved into this:
Mar 23 08:06:47 srvplan1 lrmd: [1125]: info: rsc:tomcat:0:58: stop
Mar 23 08:06:47 srvplan1 lrmd: [1125]: info: rsc:tomcat:0:59: start
Mar 23 08:06:48 srvplan1 lrmd: [1125]: info: rsc:TcpPlanServer:0:22:
monitor
Mar 23 08:06:48 srvplan1 lrmd: [1125]: info: rsc:PlanServer:0:32:
monitor
Mar 23 08:06:48 srvplan1 lrmd: [1125]: info: rsc:PlanUpdateClient:0:24:
monitor
Mar 23 08:06:51 srvplan1 lrmd: [1125]: info: rsc:ping:0:25: monitor
Mar 23 08:06:52 srvplan1 lrmd: [1125]: info: rsc:tomcat:0:60: monitor
Mar 23 08:17:53 srvplan1 lrmd: [1125]: WARN: TcpPlanServer:0:monitor
process (PID 20113) timed out (try 1). Killing with signal SIGTERM (15).
Mar 23 08:17:53 srvplan1 lrmd: [1125]: info: TcpPlanServer:0:monitor
process (PID -20113) died before killing (try 1)
Mar 23 08:17:57 srvplan1 lrmd: [1125]: WARN: apache:0:monitor process
(PID 20108) timed out (try 1). Killing with signal SIGTERM (15).
Mar 23 08:17:59 srvplan1 lrmd: [1125]: WARN: tomcat:0:monitor process
(PID 20109) timed out (try 1). Killing with signal SIGTERM (15).
Mar 23 08:18:02 srvplan1 lrmd: [1125]: WARN: apache:0:monitor process
(PID 20108) timed out (try 2). Killing with signal SIGKILL (9).
Mar 23 08:18:04 srvplan1 lrmd: [1125]: WARN: tomcat:0:monitor process
(PID 20109) timed out (try 2). Killing with signal SIGKILL (9).
Mar 23 08:18:07 srvplan1 lrmd: [1125]: ERROR:
TrackedProcTimeoutFunction: apache:0:monitor process (PID 20108) will
not die!
Mar 23 08:18:09 srvplan1 lrmd: [1125]: ERROR:
TrackedProcTimeoutFunction: tomcat:0:monitor process (PID 20109) will
not die!
Mar 23 08:18:27 srvplan1 lrmd: [1125]: WARN: drbd_pgdrive:0:monitor
process (PID 20107) timed out (try 1). Killing with signal SIGTERM (15).
Mar 23 08:18:32 srvplan1 lrmd: [1125]: WARN: drbd_pgdrive:0:monitor
process (PID 20107) timed out (try 2). Killing with signal SIGKILL (9).
Mar 23 08:18:37 srvplan1 lrmd: [1125]: ERROR:
TrackedProcTimeoutFunction: drbd_pgdrive:0:monitor process (PID 20107)
will not die!
So the first time they died too soon, and this time they refuse to die?
And it was already an hour into the problem!
Then another 45 minutes passed:
Mar 23 09:03:20 srvplan1 lrmd: [1125]: WARN: operation monitor[42] on
ocf::drbd::drbd_pgdrive:0 for client 1128, its parameters:
drbd_resource=[pgdrive] CRM_m
eta_role=[Master] CRM_meta_notify_stop_resource=[ ]
CRM_meta_notify_start_resource=[drbd_pgdrive:0 ]
CRM_meta_notify_inactive_resource=[drbd_pgdrive:0 ] CRM_m
eta_notify_master_uname=[srvplan2 ] CRM_meta_timeout=[60000]
CRM_meta_name=[monitor] CRM_meta_notify_demote_resource=[ ]
CRM_meta_notify_promote_uname=[ ] crm
_feature_set=[3.0.6] CRM_meta_notify=[true]
CRM_meta_notify_start_uname=[srvplan1 : pid [20107] timed out
Mar 23 09:03:20 srvplan1 lrmd: [1125]: WARN: operation monitor[26] on
ocf::apache::apache:0 for client 1128, its parameters:
testregex=[</html>] CRM_meta_time
out=[30000] CRM_meta_name=[monitor] crm_feature_set=[3.0.6]
CRM_meta_notify=[false] configfile=[/etc/httpd/conf/httpd.conf]
CRM_meta_clone_node_max=[1] CRM_me
ta_clone=[0] CRM_meta_clone_max=[2] CRM_meta_interval=[10000]
CRM_meta_globally_unique=[false] : pid [20108] timed out
Mar 23 09:03:20 srvplan1 lrmd: [1125]: WARN: operation monitor[60] on
ocf::tomcat::tomcat:0 for client 1128, its parameters:
java_home=[/usr/lib/jvm/jre] tomc
at_user=[tomcat] catalina_home=[/usr/share/tomcat]
CRM_meta_timeout=[30000] CRM_meta_name=[monitor] crm_feature_set=[3.0.6]
CRM_meta_notify=[false] CRM_meta_c
lone_node_max=[1] CRM_meta_clone=[0] CRM_meta_clone_max=[2]
CRM_meta_interval=[30000] script_log=[/home/tmo/log/tomcat.log]
statusurl=[http://127.0.0.1:8080/s
tatus/] CRM_meta_globally_unique=[false] : pid [20109] timed out
Mar 23 09:03:20 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the
operation operation monitor[24] on ocf::anything::PlanUpdateClient:0 for
client 1128, its parameters:
logfile=[/home/tmo/planupdateclient/PlanUpdateClient.log.txt]
CRM_meta_timeout=[20000] CRM_meta_name=[monitor]
binfile=[/home/tmo/planupdateclient/PlanUpdateClient]
crm_feature_set=[3.0.6] CRM_meta_notify=[false]
CRM_meta_clone_node_max=[1] CRM_meta_clone=[0] CRM_meta_clone_max=[2]
pidfile=[/home/tmo/planupdateclient/PlanUpdateClient.pid]
CRM_meta_interval=[10000] user=[tmo] CRM_meta_globally_unique=[false]
stayed in operation list for 2746640 ms (longer than 10000 ms)
Mar 23 09:03:20 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the
operation operation monitor[47] on ocf::pgsql::postgresql for client
1128, its parameters: crm_feature_set=[3.0.6] depth=[0]
CRM_meta_name=[monitor] CRM_meta_timeout=[30000]
CRM_meta_interval=[30000] CRM_meta_depth=[0] pgdata=[/hd2/pgsql] stayed
in operation list for 2746620 ms (longer than 10000 ms)
Mar 23 09:03:20 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the
operation operation monitor[45] on ocf::IPaddr2::DBIP for client 1128,
its parameters: cidr_netmask=[24] crm_feature_set=[3.0.6]
CRM_meta_name=[monitor] CRM_meta_interval=[10000]
CRM_meta_timeout=[20000] ip=[xxx.xxx.xxx.185] stayed in operation list
for 2746570 ms (longer than 10000 ms)
Mar 23 09:03:22 srvplan1 lrmd: [1125]: info: cancel_op: operation
monitor[47] on ocf::pgsql::postgresql for client 1128, its parameters:
crm_feature_set=[3.0.6] depth=[0] CRM_meta_name=[monitor]
CRM_meta_timeout=[30000] CRM_meta_interval=[30000] CRM_meta_depth=[0]
pgdata=[/hd2/pgsql] cancelled
Mar 23 09:03:22 srvplan1 lrmd: [1125]: info: flush_op: process for
operation monitor[47] on ocf::pgsql::postgresql for client 1128 still
running, flush delayed
Mar 23 09:03:22 srvplan1 lrmd: [1125]: info: on_msg_cancel_op: operation
47 running, cancel pending
Mar 23 09:03:22 srvplan1 lrmd: [1125]: info: perform_op:2906: operation
monitor[47] on ocf::pgsql::postgresql for client 1128, its parameters:
crm_feature_set=[3.0.6] depth=[0] CRM_meta_name=[monitor]
CRM_meta_timeout=[30000] CRM_meta_interval=[30000] CRM_meta_depth=[0]
pgdata=[/hd2/pgsql] for rsc is already running.
Mar 23 09:03:22 srvplan1 lrmd: [1125]: info: perform_op:2916: postponing
all ops on resource postgresql by 1000 ms
Mar 23 09:03:22 srvplan1 lrmd: [1125]: info: cancel_op: operation
monitor[60] on ocf::tomcat::tomcat:0 for client 1128, its parameters:
java_home=[/usr/lib/jvm/jre] tomcat_user=[tomcat]
catalina_home=[/usr/share/tomcat] CRM_meta_timeout=[30000]
CRM_meta_name=[monitor] crm_feature_set=[3.0.6] CRM_meta_notify=[false]
CRM_meta_clone_node_max=[1] CRM_meta_clone=[0] CRM_meta_clone_max=[2]
CRM_meta_interval=[30000] script_log=[/home/tmo/log/tomcat.log]
statusurl=[http://127.0.0.1:8080/status/]
CRM_meta_globally_unique=[false] cancelled
Mar 23 09:03:22 srvplan1 lrmd: [1125]: info: cancel_op: operation
monitor[26] on ocf::apache::apache:0 for client 1128, its parameters:
testregex=[</html>] CRM_meta_timeout=[30000] CRM_meta_name=[monitor]
crm_feature_set=[3.0.6] CRM_meta_notify=[false]
configfile=[/etc/httpd/conf/httpd.conf] CRM_meta_clone_node_max=[1]
CRM_meta_clone=[0] CRM_meta_clone_max=[2] CRM_meta_interval=[10000]
CRM_meta_globally_unique=[false] cancelled
Mar 23 09:03:23 srvplan1 lrmd: [1125]: info: perform_op:2906: operation
monitor[47] on ocf::pgsql::postgresql for client 1128, its parameters:
crm_feature_set=[3.0.6] depth=[0] CRM_meta_name=[monitor]
CRM_meta_timeout=[30000] CRM_meta_interval=[30000] CRM_meta_depth=[0]
pgdata=[/hd2/pgsql] for rsc is already running.
Mar 23 09:03:23 srvplan1 lrmd: [1125]: info: perform_op:2916: postponing
all ops on resource postgresql by 1000 ms
Mar 23 09:03:24 srvplan1 lrmd: [1125]: info: perform_op:2906: operation
monitor[47] on ocf::pgsql::postgresql for client 1128, its parameters:
crm_feature_set=[3.0.6] depth=[0] CRM_meta_name=[monitor]
CRM_meta_timeout=[30000] CRM_meta_interval=[30000] CRM_meta_depth=[0]
pgdata=[/hd2/pgsql] for rsc is already running.
Mar 23 09:03:24 srvplan1 lrmd: [1125]: info: perform_op:2916: postponing
all ops on resource postgresql by 1000 ms
It goes on for some time, and then at some point Pacemaker decides to
stop PostgreSQL:
Mar 23 09:03:50 srvplan1 lrmd: [1125]: WARN: operation monitor[45] on
ocf::IPaddr2::DBIP for client 1128, its parameters: cidr_netmask=[24]
crm_feature_set=[3.0.6] CRM_meta_name=[monitor]
CRM_meta_interval=[10000] CRM_meta_timeout=[20000] ip=[xxx.xxx.xxx.185]
: pid [20736] timed out
Mar 23 09:03:50 srvplan1 lrmd: [1125]: WARN: operation monitor[24] on
ocf::anything::PlanUpdateClient:0 for client 1128, its parameters:
logfile=[/home/tmo/planupdateclient/PlanUpdateClient.log.txt]
CRM_meta_timeout=[20000] CRM_meta_name=[monitor]
binfile=[/home/tmo/planupdateclient/PlanUpdateClient]
crm_feature_set=[3.0.6] CRM_meta_notify=[false]
CRM_meta_clone_node_max=[1] CRM_meta_clone=[0] CRM_meta_clone_max=[2]
pidfile=[/home/tmo/planupdateclient/PlanUpdateClient.pid]
CRM_meta_interval=[10000] user=[tmo] CRM_meta_globally_unique=[false] :
pid [20734] timed out
Mar 23 09:03:50 srvplan1 lrmd: [1125]: WARN: postgresql:monitor process
(PID 20735) timed out (try 1). Killing with signal SIGTERM (15).
Mar 23 09:03:50 srvplan1 lrmd: [1125]: info: perform_op:2906: operation
monitor[47] on ocf::pgsql::postgresql for client 1128, its parameters:
crm_feature_set=[3.0.6] depth=[0] CRM_meta_name=[monitor]
CRM_meta_timeout=[30000] CRM_meta_interval=[30000] CRM_meta_depth=[0]
pgdata=[/hd2/pgsql] for rsc is already running.
Mar 23 09:03:51 srvplan1 lrmd: [1125]: info: perform_op:2916: postponing
all ops on resource postgresql by 1000 ms
Mar 23 09:03:51 srvplan1 lrmd: [1125]: info: rsc:drbd_pgdrive:0:62:
notify
Mar 23 09:03:51 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the
operation operation notify[62] on ocf::drbd::drbd_pgdrive:0 for client
1128, its parameters: drbd_resource=[pgdrive]
CRM_meta_notify_stop_resource=[drbd_pgdrive:0 ]
CRM_meta_notify_start_resource=[drbd_pgdrive:0 ]
CRM_meta_notify_inactive_resource=[ ]
CRM_meta_notify_master_uname=[srvplan1 ]
CRM_meta_notify_promote_uname=[srvplan1 ] CRM_meta_timeout=[20000]
CRM_meta_name=[start] CRM_meta_notify_demote_resource=[drbd_pgdrive:0 ]
CRM_meta_notify_start_uname=[srvplan1 ] crm_feature_set=[3.0.6]
CRM_meta_clone_node_max=[ stayed in operation list for 28170 ms (longer
than 10000 ms)
Mar 23 09:03:51 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the
operation operation monitor[32] on ocf::anything::PlanServer:0 for
client 1128, its parameters:
logfile=[/home/tmo/PlanServer/PlanServer.log.txt]
CRM_meta_timeout=[20000] CRM_meta_name=[monitor]
binfile=[/home/tmo/PlanServer/PlanServer] crm_feature_set=[3.0.6]
CRM_meta_notify=[false] CRM_meta_clone_node_max=[1] CRM_meta_clone=[0]
CRM_meta_clone_max=[2] pidfile=[/home/tmo/PlanServer/PlanServer.pid]
CRM_meta_interval=[10000] user=[tmo] CRM_meta_globally_unique=[false]
stayed in operation list for 2776430 ms (longer than 10000 ms)
Mar 23 09:03:51 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the
operation operation monitor[25] on ocf::ping::ping:0 for client 1128,
its parameters: CRM_meta_timeout=[20000] multiplier=[1000]
CRM_meta_name=[monitor] crm_feature_set=[3.0.6] CRM_meta_notify=[false]
CRM_meta_clone_node_max=[1] CRM_meta_clone=[0] host_list=[xxx.xxx.xxx.2]
CRM_meta_clone_max=[2] CRM_meta_interval=[10000]
CRM_meta_globally_unique=[false] stayed in operation list for 2774360 ms
(longer than 10000 ms)
Mar 23 09:03:51 srvplan1 lrmd: [1125]: WARN: G_SIG_dispatch: Dispatch
function for SIGCHLD was delayed 110 ms (> 100 ms) before being called
(GSource: 0x12bcc00)
Mar 23 09:03:51 srvplan1 lrmd: [1125]: info: G_SIG_dispatch: started at
2234731673 should have started at 2234731662
Mar 23 09:03:51 srvplan1 lrmd: [1125]: WARN: operation monitor[47] on
ocf::pgsql::postgresql for client 1128, its parameters:
crm_feature_set=[3.0.6] depth=[0] CRM_meta_name=[monitor]
CRM_meta_timeout=[30000] CRM_meta_interval=[30000] CRM_meta_depth=[0]
pgdata=[/hd2/pgsql] : pid [20735] timed out
Mar 23 09:03:51 srvplan1 lrmd: [1125]: info: rsc:postgresql:61: stop
Mar 23 09:03:51 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the
operation operation stop[61] on ocf::pgsql::postgresql for client 1128,
its parameters: crm_feature_set=[3.0.6] CRM_meta_name=[stop]
CRM_meta_timeout=[60000] pgdata=[/hd2/pgsql] stayed in operation list
for 28170 ms (longer than 10000 ms)
Mar 23 09:04:11 srvplan1 lrmd: [1125]: WARN: drbd_pgdrive:0:notify
process (PID 20776) timed out (try 1). Killing with signal SIGTERM (15).
Mar 23 09:04:11 srvplan1 lrmd: [1125]: WARN: PlanServer:0:monitor
process (PID 20777) timed out (try 1). Killing with signal SIGTERM (15).
Mar 23 09:04:11 srvplan1 lrmd: [1125]: WARN: ping:0:monitor process (PID
20778) timed out (try 1). Killing with signal SIGTERM (15).
Mar 23 09:04:11 srvplan1 lrmd: [1125]: WARN: operation notify[62] on
ocf::drbd::drbd_pgdrive:0 for client 1128, its parameters:
drbd_resource=[pgdrive] CRM_meta_notify_stop_resource=[drbd_pgdrive:0 ]
CRM_meta_notify_start_resource=[drbd_pgdrive:0 ]
CRM_meta_notify_inactive_resource=[ ]
CRM_meta_notify_master_uname=[srvplan1 ]
CRM_meta_notify_promote_uname=[srvplan1 ] CRM_meta_timeout=[20000]
CRM_meta_name=[start] CRM_meta_notify_demote_resource=[drbd_pgdrive:0 ]
CRM_meta_notify_start_uname=[srvplan1 ] crm_feature_set=[3.0.6]
CRM_meta_clone_node_max=[: pid [20776] timed out
Mar 23 09:04:11 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the
operation operation monitor[42] on ocf::drbd::drbd_pgdrive:0 for client
1128, its parameters: drbd_resource=[pgdrive] CRM_meta_role=[Master]
CRM_meta_notify_stop_resource=[ ]
CRM_meta_notify_start_resource=[drbd_pgdrive:0 ]
CRM_meta_notify_inactive_resource=[drbd_pgdrive:0 ]
CRM_meta_notify_master_uname=[srvplan2 ] CRM_meta_timeout=[60000]
CRM_meta_name=[monitor] CRM_meta_notify_demote_resource=[ ]
CRM_meta_notify_promote_uname=[ ] crm_feature_set=[3.0.6]
CRM_meta_notify=[true] CRM_meta_notify_start_uname=[srvplan1 stayed in
operation list for 40500 ms (longer than 10000 ms)
Mar 23 09:04:11 srvplan1 lrmd: [1125]: WARN: operation monitor[32] on
ocf::anything::PlanServer:0 for client 1128, its parameters:
logfile=[/home/tmo/PlanServer/PlanServer.log.txt]
CRM_meta_timeout=[20000] CRM_meta_name=[monitor]
binfile=[/home/tmo/PlanServer/PlanServer] crm_feature_set=[3.0.6]
CRM_meta_notify=[false] CRM_meta_clone_node_max=[1] CRM_meta_clone=[0]
CRM_meta_clone_max=[2] pidfile=[/home/tmo/PlanServer/PlanServer.pid]
CRM_meta_interval=[10000] user=[tmo] CRM_meta_globally_unique=[false] :
pid [20777] timed out
Mar 23 09:04:11 srvplan1 lrmd: [1125]: WARN: operation monitor[25] on
ocf::ping::ping:0 for client 1128, its parameters:
CRM_meta_timeout=[20000] multiplier=[1000] CRM_meta_name=[monitor]
crm_feature_set=[3.0.6] CRM_meta_notify=[false]
CRM_meta_clone_node_max=[1] CRM_meta_clone=[0] host_list=[xxx.xxx.xxx.2]
CRM_meta_clone_max=[2] CRM_meta_interval=[10000]
CRM_meta_globally_unique=[false] : pid [20778] timed out
Mar 23 09:04:11 srvplan1 lrmd: [1125]: info: rsc:tomcat:0:63: stop
Mar 23 09:04:11 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the
operation operation stop[63] on ocf::tomcat::tomcat:0 for client 1128,
its parameters: java_home=[/usr/lib/jvm/jre] tomcat_user=[tomcat]
catalina_home=[/usr/share/tomcat] CRM_meta_timeout=[120000]
CRM_meta_name=[stop] crm_feature_set=[3.0.6] CRM_meta_notify=[false]
CRM_meta_clone_node_max=[1] CRM_meta_clone=[0] CRM_meta_clone_max=[2]
script_log=[/home/tmo/log/tomcat.log]
statusurl=[http://127.0.0.1:8080/status/]
CRM_meta_globally_unique=[false] stayed in operation list for 48100 ms
(longer than 10000 ms)
Mar 23 09:04:11 srvplan1 lrmd: [1125]: info: rsc:apache:0:64: stop
Mar 23 09:04:11 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the
operation operation stop[64] on ocf::apache::apache:0 for client 1128,
its parameters: testregex=[</html>] CRM_meta_timeout=[60000]
CRM_meta_name=[stop] crm_feature_set=[3.0.6] CRM_meta_notify=[false]
configfile=[/etc/httpd/conf/httpd.conf] CRM_meta_clone_node_max=[1]
CRM_meta_clone=[0] CRM_meta_clone_max=[2]
CRM_meta_globally_unique=[false] stayed in operation list for 48100 ms
(longer than 10000 ms)
Mar 23 09:04:51 srvplan1 lrmd: [1125]: WARN: postgresql:stop process
(PID 20779) timed out (try 1). Killing with signal SIGTERM (15).
Mar 23 09:04:51 srvplan1 lrmd: [1125]: WARN: operation stop[61] on
ocf::pgsql::postgresql for client 1128, its parameters:
crm_feature_set=[3.0.6] CRM_meta_name=[stop] CRM_meta_timeout=[60000]
pgdata=[/hd2/pgsql] : pid [20779] timed out
Mar 23 09:04:51 srvplan1 lrmd: [1125]: WARN: perform_ra_op: the
operation operation monitor[32] on ocf::anything::PlanServer:0 for
client 1128, its parameters:
logfile=[/home/tmo/PlanServer/PlanServer.log.txt]
CRM_meta_timeout=[20000] CRM_meta_name=[monitor]
binfile=[/home/tmo/PlanServer/PlanServer] crm_feature_set=[3.0.6]
CRM_meta_notify=[false] CRM_meta_clone_node_max=[1] CRM_meta_clone=[0]
CRM_meta_clone_max=[2] pidfile=[/home/tmo/PlanServer/PlanServer.pid]
CRM_meta_interval=[10000] user=[tmo] CRM_meta_globally_unique=[false]
stayed in operation list for 30010 ms (longer than 10000 ms)
Mar 23 09:05:11 srvplan1 lrmd: [1125]: WARN: drbd_pgdrive:0:monitor
process (PID 20790) timed out (try 1). Killing with signal SIGTERM (15).
Mar 23 09:05:11 srvplan1 lrmd: [1125]: WARN: apache:0:stop process (PID
20792) timed out (try 1). Killing with signal SIGTERM (15).
Mar 23 09:05:11 srvplan1 lrmd: [1125]: WARN: PlanServer:0:monitor
process (PID 20797) timed out (try 1). Killing with signal SIGTERM (15).
Mar 23 09:05:16 srvplan1 lrmd: [1125]: WARN: drbd_pgdrive:0:monitor
process (PID 20790) timed out (try 2). Killing with signal SIGKILL (9).
Mar 23 09:05:16 srvplan1 lrmd: [1125]: WARN: apache:0:stop process (PID
20792) timed out (try 2). Killing with signal SIGKILL (9).
Mar 23 09:05:16 srvplan1 lrmd: [1125]: WARN: PlanServer:0:monitor
process (PID 20797) timed out (try 2). Killing with signal SIGKILL (9).
Mar 23 09:05:21 srvplan1 lrmd: [1125]: ERROR:
TrackedProcTimeoutFunction: drbd_pgdrive:0:monitor process (PID 20790)
will not die!
Mar 23 09:05:21 srvplan1 lrmd: [1125]: ERROR:
TrackedProcTimeoutFunction: apache:0:stop process (PID 20792) will not
die!
Mar 23 09:05:21 srvplan1 lrmd: [1125]: ERROR:
TrackedProcTimeoutFunction: PlanServer:0:monitor process (PID 20797)
will not die!
Mar 23 09:06:11 srvplan1 lrmd: [1125]: WARN: tomcat:0:stop process (PID
20791) timed out (try 1). Killing with signal SIGTERM (15).
Mar 23 09:06:16 srvplan1 lrmd: [1125]: WARN: tomcat:0:stop process (PID
20791) timed out (try 2). Killing with signal SIGKILL (9).
Mar 23 09:06:21 srvplan1 lrmd: [1125]: ERROR:
TrackedProcTimeoutFunction: tomcat:0:stop process (PID 20791) will not
die!
Mar 23 09:06:23 srvplan1 lrmd: [1125]: info: perform_op:2906: operation
monitor[42] on ocf::drbd::drbd_pgdrive:0 for client 1128, its
parameters: drbd_resource=[pgdrive] CRM_meta_role=[Master]
CRM_meta_notify_stop_resource=[ ]
CRM_meta_notify_start_resource=[drbd_pgdrive:0 ]
CRM_meta_notify_inactive_resource=[drbd_pgdrive:0 ]
CRM_meta_notify_master_uname=[srvplan2 ] CRM_meta_timeout=[60000]
CRM_meta_name=[monitor] CRM_meta_notify_demote_resource=[ ]
CRM_meta_notify_promote_uname=[ ] crm_feature_set=[3.0.6]
CRM_meta_notify=[true] CRM_meta_notify_start_uname=[srvplan1 for rsc is
already running.
Mar 23 09:06:23 srvplan1 lrmd: [1125]: info: perform_op:2916: postponing
all ops on resource drbd_pgdrive:0 by 1000 ms
At this point the second node finally realizes something is wrong there,
fences the first node and takes over. After reboot, everything looks
like it's working fine now. Needless to say, 1 hour 45 minutes is a bit
too long for a recovery.
Got any ideas where to look? Basically I'd like Pacemaker to detect
whatever happened and migrate to another node before trying to monitor,
restart or whatever else it tried to do with those resources.
As far as I understand, Pacemaker is supposed to restart a service as
soon as the monitor operation fails (provided that I didn't specify
on-fail for the monitor action). Why didn't it try to restart any
resources until 45 minutes later? I expected to see something like this:
monitor fails -> restart fails -> STONITH
More information about the Pacemaker
mailing list