[Pacemaker] Timeout value of STONITH resource is too large

Fri Aug 3 03:19:49 UTC 2012

----- Original Message -----
> From: "Kazunori INOUE" <inouekazu at intellilink.co.jp>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Thursday, August 2, 2012 4:37:45 AM
> Subject: Re: [Pacemaker] Timeout value of STONITH resource is too large
> 
> Hi David,
> 
> The resource (process forked from stonithd) stays until sleep is
> completed.
> stonithd does not call st_child_term().
> 
> I set 61 seconds in timeout of start.
> Therefore, the value of track->timeout in run_stonith_agent() is
> 61000(ms).
> However, it is after '1000*track->timeout' milliseconds that
> st_child_term()
> is called. That is, it is approximately _17 hours_ later.

Whoa, that isn't good at all!  Please create a bug report for this.  Include your config.

-- Vossel

> - lib/fencing/st_client.c
>   476 run_stonith_agent(const char *agent, const char *action, const
>   char *victim,
>   477               GHashTable * device_args, GHashTable * port_map,
>   int *agent_result, char **output,
>   478               async_command_t * track)
> 
>   541         if (track->timeout) {
>   542             track->pid = pid;
>   543             track->timer_sigterm =
>   g_timeout_add(1000*track->timeout, st_child_term, track);
>   544             track->timer_sigkill =
>   g_timeout_add(1000*(track->timeout+5), st_child_kill, track);
> 
> Best Regards,
> Kazunori INOUE
> 
> (12.08.01 00:01), David Vossel wrote:
> > ----- Original Message -----
> >> From: "Kazunori INOUE" <inouekazu at intellilink.co.jp>
> >> To: "pacemaker at oss" <pacemaker at oss.clusterlabs.org>
> >> Cc: shimazakik at intellilink.co.jp
> >> Sent: Monday, July 30, 2012 5:13:40 AM
> >> Subject: [Pacemaker] Timeout value of STONITH resource is too
> >> large
> >>
> >> Hi,
> >>
> >> I am using Pacemaker-1.1.
> >> - glue       (2012 Jul 16) 2719:18489f275f75
> >> - libqb      (2012 Jul 19)
> >> 11b20e19beff7f1b6003be0b4c73da8ecf936442
> >> - corosync   (2012 Jul 12)
> >> 908ed7dcb390c0eade3dddb0cdfe181eb26b2ce2
> >> - pacemaker  (2012 Jul 29)
> >> 33119da31c235710195c783e5c9a32c6e95b3efc
> >>
> >> The timeout value of the _start_ operation of STONITH resource is
> >> large.
> >> Therefore, even after the start operation is timed out, the
> >> process
> >> of
> >> plugin remains.
> >
> > How long after the timeout occurs does the process stay around?
> >  Does it terminate a few seconds after the timeout, or does the
> > resource wait until the entire duration of the sleep 3600
> > finishes?
> >
> > -- Vossel
> >
> >
> >>
> >> The following is gdb at the time of STONITH resource starting.
> >> ----
> >>     [root at dev1 ~]# gdb /usr/libexec/pacemaker/stonithd `pgrep
> >>     stonithd`
> >>     (gdb) b run_stonith_agent
> >>     Breakpoint 1 at 0x7f03f1e00d69: file st_client.c, line 479.
> >>     (gdb) c
> >>     Continuing.
> >>
> >>     Breakpoint 1, run_stonith_agent (agent=0xe0f820
> >>     "fence_legacy",
> >>     action=0xe11fb0 "monitor",
> >>      <snip>
> >>     479     {
> >>     (gdb) bt
> >>     #0  run_stonith_agent (agent=0xe0f820 "fence_legacy",
> >>     action=0xe11fb0 "monitor",
> >>         victim=0x0, device_args=Traceback (most recent call
> >>         last):0xcffe30, port_map=
> >>         Traceback (most recent call last):0xcffe80,
> >>         agent_result=0x7fff70214ef4,
> >>         output=0x0, track=0xe11d20) at st_client.c:479
> >>     #1  0x0000000000406230 in stonith_device_execute
> >>     (device=0xe10ff0)
> >>     at commands.c:140
> >>     #2  0x0000000000406404 in stonith_device_dispatch
> >>     (user_data=0xe10ff0) at commands.c:160
> >>     #3  0x00007f03f224ad00 in crm_trigger_dispatch
> >>     (source=0xe11160,
> >>     callback=
> >>         0x4063dd <stonith_device_dispatch>, userdata=0xe11160) at
> >>         mainloop.c:105
> >>     #4  0x0000003642638f0e in g_main_context_dispatch () from
> >>     /lib64/libglib-2.0.so.0
> >>     #5  0x000000364263c938 in ?? () from /lib64/libglib-2.0.so.0
> >>     #6  0x000000364263cd55 in g_main_loop_run () from
> >>     /lib64/libglib-2.0.so.0
> >>     #7  0x00000000004056dc in main (argc=1, argv=0x7fff70215278)
> >>     at
> >>     main.c:853
> >>     (gdb) n 15
> >>     Detaching after fork from child process 28915.
> >>     510       if (pid) {
> >>     (gdb) n 15
> >>     542                 track->pid = pid;
> >>     (gdb) list
> >>     537             track->stdout = p_read_fd;
> >>     538             g_child_watch_add(pid, track->done, track);
> >>     539             crm_trace("Op: %s on %s, pid: %d, timeout:
> >>     %ds",
> >>     action, agent, pid, track->timeout);
> >>     540
> >>     541             if (track->timeout) {
> >>     542                 track->pid = pid;
> >>     543                 track->timer_sigterm =
> >>     g_timeout_add(1000*track->timeout, st_child_term, track);
> >>     544                 track->timer_sigkill =
> >>     g_timeout_add(1000*(track->timeout+5), st_child_kill, track);
> >>     545
> >>     546             } else {
> >>     (gdb) n
> >>     543                 track->timer_sigterm =
> >>     g_timeout_add(1000*track->timeout, st_child_term, track);
> >>     (gdb) n
> >>     544                 track->timer_sigkill =
> >>     g_timeout_add(1000*(track->timeout+5), st_child_kill, track);
> >>     (gdb) p agent
> >>     $1 = 0xe0f820 "fence_legacy"
> >>     (gdb) p action
> >>     $2 = 0xe11fb0 "monitor"
> >>     (gdb) p args
> >>     $3 = 0xe11500
> >>     "plugin=external/libvirt\nhostlist=dev2\nhypervisor_uri=qemu+ssh://n8/system\noption=monitor\n"
> >>   * (gdb) p track->timeout
> >>     $4 = 61000
> >>   * (gdb) p 1000*track->timeout
> >>     $5 = 61000000
> >> ----
> >> 1. I added "sleep 3600" to status() of
> >>     /usr/lib64/stonith/plugins/external/libvirt.
> >>
> >>     [root at dev1 external]# diff -u libvirt.ORG libvirt
> >>     --- libvirt.ORG 2012-07-17 13:10:01.000000000 +0900
> >>     +++ libvirt     2012-07-30 13:36:19.661431208 +0900
> >>     @@ -221,6 +221,7 @@
> >>          ;;
> >>
> >>          status)
> >>     +    sleep 3600
> >>          libvirt_check_config
> >>          libvirt_status
> >>          exit $?
> >>
> >> 2. service corosync start ; service pacemaker start
> >> 3. cibadmin -U -x test.xml
> >> 4. When I wait for 61 seconds (timeout value of start),
> >>
> >>     [root at dev1 ~]# crm_mon -rf1
> >>     ============
> >>     Last updated: Mon Jul 30 13:18:48 2012
> >>     Last change: Mon Jul 30 13:15:08 2012 via cibadmin on dev1
> >>     Stack: corosync
> >>     Current DC: dev1 (-1788499776) - partition with quorum
> >>     Version: 1.1.7-33119da
> >>     2 Nodes configured, unknown expected votes
> >>     1 Resources configured.
> >>     ============
> >>
> >>     Online: [ dev1 dev2 ]
> >>
> >>     Full list of resources:
> >>
> >>      f-2    (stonith:external/libvirt):     Started dev1 FAILED
> >>
> >>     Migration summary:
> >>     * Node dev2:
> >>     * Node dev1:
> >>        f-2: migration-threshold=1 fail-count=1000000
> >>
> >>     Failed actions:
> >>   *     f-2_start_0 (node=dev1, call=-1, rc=1, status=Timed Out):
> >>   unknown error
> >>
> >>     [root at dev1 ~]# ps -ef|egrep
> >>     "UID|corosync|pacemaker|stonith|fence|sleep"
> >>     UID    PID  PPID  C STIME TTY     TIME CMD
> >>     root 28840     1  0 13:13 ?   00:00:01 corosync
> >>     root 28858     1  0 13:13 ?   00:00:00 pacemakerd
> >>     496  28860 28858  0 13:13 ?   00:00:00
> >>     /usr/libexec/pacemaker/cib
> >>     root 28861 28858  0 13:13 ?   00:00:00
> >>     /usr/libexec/pacemaker/stonithd
> >>     root 28862 28858 73 13:13 ?   00:04:16
> >>     /usr/libexec/pacemaker/lrmd
> >>     496  28863 28858  0 13:13 ?   00:00:00
> >>     /usr/libexec/pacemaker/attrd
> >>     496  28864 28858  0 13:13 ?   00:00:00
> >>     /usr/libexec/pacemaker/pengine
> >>     496  28865 28858 51 13:13 ?   00:02:58
> >>     /usr/libexec/pacemaker/crmd
> >>   * root 28915 28861  0 13:15 ?   00:00:00 /usr/bin/perl
> >>   /usr/sbin/fence_legacy
> >>   * root 28916 28915  0 13:15 ?   00:00:00 stonith -t
> >>   external/libvirt
> >>   -E -S
> >>   * root 28921 28916  0 13:15 ?   00:00:00 /bin/sh
> >>   /usr/lib64/stonith/plugins/external/libvirt status
> >>     root 28925 28921  0 13:15 ?   00:00:00 sleep 3600
> >>
> >>     [root at dev1 ~]# top -bn1
> >>     top - 13:21:26 up 5 days,  3:23,  5 users,  load average:
> >>     1.99,
> >>     1.42, 0.72
> >>     Tasks: 198 total,   3 running, 195 sleeping,   0 stopped,   0
> >>     zombie
> >>     Cpu(s):  0.7%us,  0.7%sy,  0.0%ni, 98.5%id,  0.0%wa,  0.0%hi,
> >>      0.0%si,  0.0%st
> >>     Mem:   5089052k total,  2423104k used,  2665948k free,
> >>       265756k
> >>     buffers
> >>     Swap:  1048568k total,        0k used,  1048568k free,
> >>      1717712k
> >>     cached
> >>
> >>       PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> >>        COMMAND
> >>   * 28862 root      20   0 83816 3412 2572 R 98.2  0.1   6:17.18
> >>   lrmd
> >>   * 28865 hacluste  20   0  166m 6380 3428 R 98.2  0.1   4:59.84
> >>   crmd
> >>     28860 hacluste  20   0 93888 7192 4472 S  2.0  0.1   0:00.23
> >>     cib
> >>     29052 root      20   0 15024 1136  792 R  2.0  0.0   0:00.01
> >>     top
> >>         1 root      20   0 19348 1520 1212 S  0.0  0.0   0:00.77
> >>         init
> >>         2 root      20   0     0    0    0 S  0.0  0.0   0:00.00
> >>         kthreadd
> >>         3 root      RT   0     0    0    0 S  0.0  0.0   0:06.85
> >>         migration/0
> >>         4 root      20   0     0    0    0 S  0.0  0.0  14:25.15
> >>         ksoftirqd/0
> >>         5 root      RT   0     0    0    0 S  0.0  0.0   0:00.10
> >>         migration/0
> >>
> >> Best Regards,
> >> Kazunori INOUE
> >>
> >> _______________________________________________
> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started:
> >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >>
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>