[Pacemaker] Timeout value of STONITH resource is too large

Sun Aug 5 22:56:47 EDT 2012

Hi David,

I reported it in Bugzilla with config.
* http://bugs.clusterlabs.org/show_bug.cgi?id=5092

Best Regards,
Kazunori INOUE

(12.08.03 12:19), David Vossel wrote:
> ----- Original Message -----
>> From: "Kazunori INOUE" <inouekazu at intellilink.co.jp>
>> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
>> Sent: Thursday, August 2, 2012 4:37:45 AM
>> Subject: Re: [Pacemaker] Timeout value of STONITH resource is too large
>>
>> Hi David,
>>
>> The resource (process forked from stonithd) stays until sleep is
>> completed.
>> stonithd does not call st_child_term().
>>
>> I set 61 seconds in timeout of start.
>> Therefore, the value of track->timeout in run_stonith_agent() is
>> 61000(ms).
>> However, it is after '1000*track->timeout' milliseconds that
>> st_child_term()
>> is called. That is, it is approximately _17 hours_ later.
>
> Whoa, that isn't good at all!  Please create a bug report for this.  Include your config.
>
> -- Vossel
>
>> - lib/fencing/st_client.c
>>    476 run_stonith_agent(const char *agent, const char *action, const
>>    char *victim,
>>    477               GHashTable * device_args, GHashTable * port_map,
>>    int *agent_result, char **output,
>>    478               async_command_t * track)
>>
>>    541         if (track->timeout) {
>>    542             track->pid = pid;
>>    543             track->timer_sigterm =
>>    g_timeout_add(1000*track->timeout, st_child_term, track);
>>    544             track->timer_sigkill =
>>    g_timeout_add(1000*(track->timeout+5), st_child_kill, track);
>>
>> Best Regards,
>> Kazunori INOUE
>>
>> (12.08.01 00:01), David Vossel wrote:
>>> ----- Original Message -----
>>>> From: "Kazunori INOUE" <inouekazu at intellilink.co.jp>
>>>> To: "pacemaker at oss" <pacemaker at oss.clusterlabs.org>
>>>> Cc: shimazakik at intellilink.co.jp
>>>> Sent: Monday, July 30, 2012 5:13:40 AM
>>>> Subject: [Pacemaker] Timeout value of STONITH resource is too
>>>> large
>>>>
>>>> Hi,
>>>>
>>>> I am using Pacemaker-1.1.
>>>> - glue       (2012 Jul 16) 2719:18489f275f75
>>>> - libqb      (2012 Jul 19)
>>>> 11b20e19beff7f1b6003be0b4c73da8ecf936442
>>>> - corosync   (2012 Jul 12)
>>>> 908ed7dcb390c0eade3dddb0cdfe181eb26b2ce2
>>>> - pacemaker  (2012 Jul 29)
>>>> 33119da31c235710195c783e5c9a32c6e95b3efc
>>>>
>>>> The timeout value of the _start_ operation of STONITH resource is
>>>> large.
>>>> Therefore, even after the start operation is timed out, the
>>>> process
>>>> of
>>>> plugin remains.
>>>
>>> How long after the timeout occurs does the process stay around?
>>>   Does it terminate a few seconds after the timeout, or does the
>>> resource wait until the entire duration of the sleep 3600
>>> finishes?
>>>
>>> -- Vossel
>>>
>>>
>>>>
>>>> The following is gdb at the time of STONITH resource starting.
>>>> ----
>>>>      [root at dev1 ~]# gdb /usr/libexec/pacemaker/stonithd `pgrep
>>>>      stonithd`
>>>>      (gdb) b run_stonith_agent
>>>>      Breakpoint 1 at 0x7f03f1e00d69: file st_client.c, line 479.
>>>>      (gdb) c
>>>>      Continuing.
>>>>
>>>>      Breakpoint 1, run_stonith_agent (agent=0xe0f820
>>>>      "fence_legacy",
>>>>      action=0xe11fb0 "monitor",
>>>>       <snip>
>>>>      479     {
>>>>      (gdb) bt
>>>>      #0  run_stonith_agent (agent=0xe0f820 "fence_legacy",
>>>>      action=0xe11fb0 "monitor",
>>>>          victim=0x0, device_args=Traceback (most recent call
>>>>          last):0xcffe30, port_map=
>>>>          Traceback (most recent call last):0xcffe80,
>>>>          agent_result=0x7fff70214ef4,
>>>>          output=0x0, track=0xe11d20) at st_client.c:479
>>>>      #1  0x0000000000406230 in stonith_device_execute
>>>>      (device=0xe10ff0)
>>>>      at commands.c:140
>>>>      #2  0x0000000000406404 in stonith_device_dispatch
>>>>      (user_data=0xe10ff0) at commands.c:160
>>>>      #3  0x00007f03f224ad00 in crm_trigger_dispatch
>>>>      (source=0xe11160,
>>>>      callback=
>>>>          0x4063dd <stonith_device_dispatch>, userdata=0xe11160) at
>>>>          mainloop.c:105
>>>>      #4  0x0000003642638f0e in g_main_context_dispatch () from
>>>>      /lib64/libglib-2.0.so.0
>>>>      #5  0x000000364263c938 in ?? () from /lib64/libglib-2.0.so.0
>>>>      #6  0x000000364263cd55 in g_main_loop_run () from
>>>>      /lib64/libglib-2.0.so.0
>>>>      #7  0x00000000004056dc in main (argc=1, argv=0x7fff70215278)
>>>>      at
>>>>      main.c:853
>>>>      (gdb) n 15
>>>>      Detaching after fork from child process 28915.
>>>>      510       if (pid) {
>>>>      (gdb) n 15
>>>>      542                 track->pid = pid;
>>>>      (gdb) list
>>>>      537             track->stdout = p_read_fd;
>>>>      538             g_child_watch_add(pid, track->done, track);
>>>>      539             crm_trace("Op: %s on %s, pid: %d, timeout:
>>>>      %ds",
>>>>      action, agent, pid, track->timeout);
>>>>      540
>>>>      541             if (track->timeout) {
>>>>      542                 track->pid = pid;
>>>>      543                 track->timer_sigterm =
>>>>      g_timeout_add(1000*track->timeout, st_child_term, track);
>>>>      544                 track->timer_sigkill =
>>>>      g_timeout_add(1000*(track->timeout+5), st_child_kill, track);
>>>>      545
>>>>      546             } else {
>>>>      (gdb) n
>>>>      543                 track->timer_sigterm =
>>>>      g_timeout_add(1000*track->timeout, st_child_term, track);
>>>>      (gdb) n
>>>>      544                 track->timer_sigkill =
>>>>      g_timeout_add(1000*(track->timeout+5), st_child_kill, track);
>>>>      (gdb) p agent
>>>>      $1 = 0xe0f820 "fence_legacy"
>>>>      (gdb) p action
>>>>      $2 = 0xe11fb0 "monitor"
>>>>      (gdb) p args
>>>>      $3 = 0xe11500
>>>>      "plugin=external/libvirt\nhostlist=dev2\nhypervisor_uri=qemu+ssh://n8/system\noption=monitor\n"
>>>>    * (gdb) p track->timeout
>>>>      $4 = 61000
>>>>    * (gdb) p 1000*track->timeout
>>>>      $5 = 61000000
>>>> ----
>>>> 1. I added "sleep 3600" to status() of
>>>>      /usr/lib64/stonith/plugins/external/libvirt.
>>>>
>>>>      [root at dev1 external]# diff -u libvirt.ORG libvirt
>>>>      --- libvirt.ORG 2012-07-17 13:10:01.000000000 +0900
>>>>      +++ libvirt     2012-07-30 13:36:19.661431208 +0900
>>>>      @@ -221,6 +221,7 @@
>>>>           ;;
>>>>
>>>>           status)
>>>>      +    sleep 3600
>>>>           libvirt_check_config
>>>>           libvirt_status
>>>>           exit $?
>>>>
>>>> 2. service corosync start ; service pacemaker start
>>>> 3. cibadmin -U -x test.xml
>>>> 4. When I wait for 61 seconds (timeout value of start),
>>>>
>>>>      [root at dev1 ~]# crm_mon -rf1
>>>>      ============
>>>>      Last updated: Mon Jul 30 13:18:48 2012
>>>>      Last change: Mon Jul 30 13:15:08 2012 via cibadmin on dev1
>>>>      Stack: corosync
>>>>      Current DC: dev1 (-1788499776) - partition with quorum
>>>>      Version: 1.1.7-33119da
>>>>      2 Nodes configured, unknown expected votes
>>>>      1 Resources configured.
>>>>      ============
>>>>
>>>>      Online: [ dev1 dev2 ]
>>>>
>>>>      Full list of resources:
>>>>
>>>>       f-2    (stonith:external/libvirt):     Started dev1 FAILED
>>>>
>>>>      Migration summary:
>>>>      * Node dev2:
>>>>      * Node dev1:
>>>>         f-2: migration-threshold=1 fail-count=1000000
>>>>
>>>>      Failed actions:
>>>>    *     f-2_start_0 (node=dev1, call=-1, rc=1, status=Timed Out):
>>>>    unknown error
>>>>
>>>>      [root at dev1 ~]# ps -ef|egrep
>>>>      "UID|corosync|pacemaker|stonith|fence|sleep"
>>>>      UID    PID  PPID  C STIME TTY     TIME CMD
>>>>      root 28840     1  0 13:13 ?   00:00:01 corosync
>>>>      root 28858     1  0 13:13 ?   00:00:00 pacemakerd
>>>>      496  28860 28858  0 13:13 ?   00:00:00
>>>>      /usr/libexec/pacemaker/cib
>>>>      root 28861 28858  0 13:13 ?   00:00:00
>>>>      /usr/libexec/pacemaker/stonithd
>>>>      root 28862 28858 73 13:13 ?   00:04:16
>>>>      /usr/libexec/pacemaker/lrmd
>>>>      496  28863 28858  0 13:13 ?   00:00:00
>>>>      /usr/libexec/pacemaker/attrd
>>>>      496  28864 28858  0 13:13 ?   00:00:00
>>>>      /usr/libexec/pacemaker/pengine
>>>>      496  28865 28858 51 13:13 ?   00:02:58
>>>>      /usr/libexec/pacemaker/crmd
>>>>    * root 28915 28861  0 13:15 ?   00:00:00 /usr/bin/perl
>>>>    /usr/sbin/fence_legacy
>>>>    * root 28916 28915  0 13:15 ?   00:00:00 stonith -t
>>>>    external/libvirt
>>>>    -E -S
>>>>    * root 28921 28916  0 13:15 ?   00:00:00 /bin/sh
>>>>    /usr/lib64/stonith/plugins/external/libvirt status
>>>>      root 28925 28921  0 13:15 ?   00:00:00 sleep 3600
>>>>
>>>>      [root at dev1 ~]# top -bn1
>>>>      top - 13:21:26 up 5 days,  3:23,  5 users,  load average:
>>>>      1.99,
>>>>      1.42, 0.72
>>>>      Tasks: 198 total,   3 running, 195 sleeping,   0 stopped,   0
>>>>      zombie
>>>>      Cpu(s):  0.7%us,  0.7%sy,  0.0%ni, 98.5%id,  0.0%wa,  0.0%hi,
>>>>       0.0%si,  0.0%st
>>>>      Mem:   5089052k total,  2423104k used,  2665948k free,
>>>>        265756k
>>>>      buffers
>>>>      Swap:  1048568k total,        0k used,  1048568k free,
>>>>       1717712k
>>>>      cached
>>>>
>>>>        PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>>>>         COMMAND
>>>>    * 28862 root      20   0 83816 3412 2572 R 98.2  0.1   6:17.18
>>>>    lrmd
>>>>    * 28865 hacluste  20   0  166m 6380 3428 R 98.2  0.1   4:59.84
>>>>    crmd
>>>>      28860 hacluste  20   0 93888 7192 4472 S  2.0  0.1   0:00.23
>>>>      cib
>>>>      29052 root      20   0 15024 1136  792 R  2.0  0.0   0:00.01
>>>>      top
>>>>          1 root      20   0 19348 1520 1212 S  0.0  0.0   0:00.77
>>>>          init
>>>>          2 root      20   0     0    0    0 S  0.0  0.0   0:00.00
>>>>          kthreadd
>>>>          3 root      RT   0     0    0    0 S  0.0  0.0   0:06.85
>>>>          migration/0
>>>>          4 root      20   0     0    0    0 S  0.0  0.0  14:25.15
>>>>          ksoftirqd/0
>>>>          5 root      RT   0     0    0    0 S  0.0  0.0   0:00.10
>>>>          migration/0
>>>>
>>>> Best Regards,
>>>> Kazunori INOUE
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org