[Pacemaker] Timeout value of STONITH resource is too large
David Vossel
dvossel at redhat.com
Tue Jul 31 15:01:06 UTC 2012
----- Original Message -----
> From: "Kazunori INOUE" <inouekazu at intellilink.co.jp>
> To: "pacemaker at oss" <pacemaker at oss.clusterlabs.org>
> Cc: shimazakik at intellilink.co.jp
> Sent: Monday, July 30, 2012 5:13:40 AM
> Subject: [Pacemaker] Timeout value of STONITH resource is too large
>
> Hi,
>
> I am using Pacemaker-1.1.
> - glue (2012 Jul 16) 2719:18489f275f75
> - libqb (2012 Jul 19) 11b20e19beff7f1b6003be0b4c73da8ecf936442
> - corosync (2012 Jul 12) 908ed7dcb390c0eade3dddb0cdfe181eb26b2ce2
> - pacemaker (2012 Jul 29) 33119da31c235710195c783e5c9a32c6e95b3efc
>
> The timeout value of the _start_ operation of STONITH resource is
> large.
> Therefore, even after the start operation is timed out, the process
> of
> plugin remains.
How long after the timeout occurs does the process stay around? Does it terminate a few seconds after the timeout, or does the resource wait until the entire duration of the sleep 3600 finishes?
-- Vossel
>
> The following is gdb at the time of STONITH resource starting.
> ----
> [root at dev1 ~]# gdb /usr/libexec/pacemaker/stonithd `pgrep
> stonithd`
> (gdb) b run_stonith_agent
> Breakpoint 1 at 0x7f03f1e00d69: file st_client.c, line 479.
> (gdb) c
> Continuing.
>
> Breakpoint 1, run_stonith_agent (agent=0xe0f820 "fence_legacy",
> action=0xe11fb0 "monitor",
> <snip>
> 479 {
> (gdb) bt
> #0 run_stonith_agent (agent=0xe0f820 "fence_legacy",
> action=0xe11fb0 "monitor",
> victim=0x0, device_args=Traceback (most recent call
> last):0xcffe30, port_map=
> Traceback (most recent call last):0xcffe80,
> agent_result=0x7fff70214ef4,
> output=0x0, track=0xe11d20) at st_client.c:479
> #1 0x0000000000406230 in stonith_device_execute (device=0xe10ff0)
> at commands.c:140
> #2 0x0000000000406404 in stonith_device_dispatch
> (user_data=0xe10ff0) at commands.c:160
> #3 0x00007f03f224ad00 in crm_trigger_dispatch (source=0xe11160,
> callback=
> 0x4063dd <stonith_device_dispatch>, userdata=0xe11160) at
> mainloop.c:105
> #4 0x0000003642638f0e in g_main_context_dispatch () from
> /lib64/libglib-2.0.so.0
> #5 0x000000364263c938 in ?? () from /lib64/libglib-2.0.so.0
> #6 0x000000364263cd55 in g_main_loop_run () from
> /lib64/libglib-2.0.so.0
> #7 0x00000000004056dc in main (argc=1, argv=0x7fff70215278) at
> main.c:853
> (gdb) n 15
> Detaching after fork from child process 28915.
> 510 if (pid) {
> (gdb) n 15
> 542 track->pid = pid;
> (gdb) list
> 537 track->stdout = p_read_fd;
> 538 g_child_watch_add(pid, track->done, track);
> 539 crm_trace("Op: %s on %s, pid: %d, timeout: %ds",
> action, agent, pid, track->timeout);
> 540
> 541 if (track->timeout) {
> 542 track->pid = pid;
> 543 track->timer_sigterm =
> g_timeout_add(1000*track->timeout, st_child_term, track);
> 544 track->timer_sigkill =
> g_timeout_add(1000*(track->timeout+5), st_child_kill, track);
> 545
> 546 } else {
> (gdb) n
> 543 track->timer_sigterm =
> g_timeout_add(1000*track->timeout, st_child_term, track);
> (gdb) n
> 544 track->timer_sigkill =
> g_timeout_add(1000*(track->timeout+5), st_child_kill, track);
> (gdb) p agent
> $1 = 0xe0f820 "fence_legacy"
> (gdb) p action
> $2 = 0xe11fb0 "monitor"
> (gdb) p args
> $3 = 0xe11500
> "plugin=external/libvirt\nhostlist=dev2\nhypervisor_uri=qemu+ssh://n8/system\noption=monitor\n"
> * (gdb) p track->timeout
> $4 = 61000
> * (gdb) p 1000*track->timeout
> $5 = 61000000
> ----
> 1. I added "sleep 3600" to status() of
> /usr/lib64/stonith/plugins/external/libvirt.
>
> [root at dev1 external]# diff -u libvirt.ORG libvirt
> --- libvirt.ORG 2012-07-17 13:10:01.000000000 +0900
> +++ libvirt 2012-07-30 13:36:19.661431208 +0900
> @@ -221,6 +221,7 @@
> ;;
>
> status)
> + sleep 3600
> libvirt_check_config
> libvirt_status
> exit $?
>
> 2. service corosync start ; service pacemaker start
> 3. cibadmin -U -x test.xml
> 4. When I wait for 61 seconds (timeout value of start),
>
> [root at dev1 ~]# crm_mon -rf1
> ============
> Last updated: Mon Jul 30 13:18:48 2012
> Last change: Mon Jul 30 13:15:08 2012 via cibadmin on dev1
> Stack: corosync
> Current DC: dev1 (-1788499776) - partition with quorum
> Version: 1.1.7-33119da
> 2 Nodes configured, unknown expected votes
> 1 Resources configured.
> ============
>
> Online: [ dev1 dev2 ]
>
> Full list of resources:
>
> f-2 (stonith:external/libvirt): Started dev1 FAILED
>
> Migration summary:
> * Node dev2:
> * Node dev1:
> f-2: migration-threshold=1 fail-count=1000000
>
> Failed actions:
> * f-2_start_0 (node=dev1, call=-1, rc=1, status=Timed Out):
> unknown error
>
> [root at dev1 ~]# ps -ef|egrep
> "UID|corosync|pacemaker|stonith|fence|sleep"
> UID PID PPID C STIME TTY TIME CMD
> root 28840 1 0 13:13 ? 00:00:01 corosync
> root 28858 1 0 13:13 ? 00:00:00 pacemakerd
> 496 28860 28858 0 13:13 ? 00:00:00 /usr/libexec/pacemaker/cib
> root 28861 28858 0 13:13 ? 00:00:00
> /usr/libexec/pacemaker/stonithd
> root 28862 28858 73 13:13 ? 00:04:16 /usr/libexec/pacemaker/lrmd
> 496 28863 28858 0 13:13 ? 00:00:00
> /usr/libexec/pacemaker/attrd
> 496 28864 28858 0 13:13 ? 00:00:00
> /usr/libexec/pacemaker/pengine
> 496 28865 28858 51 13:13 ? 00:02:58 /usr/libexec/pacemaker/crmd
> * root 28915 28861 0 13:15 ? 00:00:00 /usr/bin/perl
> /usr/sbin/fence_legacy
> * root 28916 28915 0 13:15 ? 00:00:00 stonith -t external/libvirt
> -E -S
> * root 28921 28916 0 13:15 ? 00:00:00 /bin/sh
> /usr/lib64/stonith/plugins/external/libvirt status
> root 28925 28921 0 13:15 ? 00:00:00 sleep 3600
>
> [root at dev1 ~]# top -bn1
> top - 13:21:26 up 5 days, 3:23, 5 users, load average: 1.99,
> 1.42, 0.72
> Tasks: 198 total, 3 running, 195 sleeping, 0 stopped, 0
> zombie
> Cpu(s): 0.7%us, 0.7%sy, 0.0%ni, 98.5%id, 0.0%wa, 0.0%hi,
> 0.0%si, 0.0%st
> Mem: 5089052k total, 2423104k used, 2665948k free, 265756k
> buffers
> Swap: 1048568k total, 0k used, 1048568k free, 1717712k
> cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
> COMMAND
> * 28862 root 20 0 83816 3412 2572 R 98.2 0.1 6:17.18 lrmd
> * 28865 hacluste 20 0 166m 6380 3428 R 98.2 0.1 4:59.84 crmd
> 28860 hacluste 20 0 93888 7192 4472 S 2.0 0.1 0:00.23 cib
> 29052 root 20 0 15024 1136 792 R 2.0 0.0 0:00.01 top
> 1 root 20 0 19348 1520 1212 S 0.0 0.0 0:00.77 init
> 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00
> kthreadd
> 3 root RT 0 0 0 0 S 0.0 0.0 0:06.85
> migration/0
> 4 root 20 0 0 0 0 S 0.0 0.0 14:25.15
> ksoftirqd/0
> 5 root RT 0 0 0 0 S 0.0 0.0 0:00.10
> migration/0
>
> Best Regards,
> Kazunori INOUE
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
More information about the Pacemaker
mailing list