[Pacemaker] hangs pending

Tue Jan 14 13:15:15 UTC 2014

14.01.2014, 10:00, "Andrey Groshev" <greenx at yandex.ru>:
> 14.01.2014, 07:47, "Andrew Beekhof" <andrew at beekhof.net>:
>
>>  Ok, here's what happens:
>>
>>  1. node2 is lost
>>  2. fencing of node2 starts
>>  3. node2 reboots (and cluster starts)
>>  4. node2 returns to the membership
>>  5. node2 is marked as a cluster member
>>  6. DC tries to bring it into the cluster, but needs to cancel the active transition first.
>>     Which is a problem since the node2 fencing operation is part of that
>>  7. node2 is in a transition (pending) state until fencing passes or fails
>>  8a. fencing fails: transition completes and the node joins the cluster
>>
>>  Thats in theory, except we automatically try again. Which isn't appropriate.
>>  This should be relatively easy to fix.
>>
>>  8b. fencing passes: the node is incorrectly marked as offline
>>
>>  This I have no idea how to fix yet.
>>
>>  On another note, it doesn't look like this agent works at all.
>>  The node has been back online for a long time and the agent is still timing out after 10 minutes.
>>  So "Once the script makes sure that the victim will rebooted and again available via ssh - it exit with 0." does not seem true.
>
> Damn. Looks like you're right. At some time I broke my agent and had not noticed it. Who will understand.

I repaired my agent - after send reboot he is wait STDIN.
Returned "normally" a behavior - hangs "pending", until manually send reboot. :) 
New logs: http://send2me.ru/crmrep1.tar.bz2

>
>>  On 14 Jan 2014, at 1:19 pm, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>   Apart from anything else, your timeout needs to be bigger:
>>>
>>>   Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: (  commands.c:1321  )   error: log_operation: Operation 'reboot' [11331] (call 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 'st1' returned: -62 (Timer expired)
>>>
>>>   On 14 Jan 2014, at 7:18 am, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>>   On 13 Jan 2014, at 8:31 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>   13.01.2014, 02:51, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>   On 10 Jan 2014, at 9:55 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>   10.01.2014, 14:31, "Andrey Groshev" <greenx at yandex.ru>:
>>>>>>>>   10.01.2014, 14:01, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>   On 10 Jan 2014, at 5:03 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>    10.01.2014, 05:29, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>     On 9 Jan 2014, at 11:11 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>      08.01.2014, 06:22, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>      On 29 Nov 2013, at 7:17 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>       Hi, ALL.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       I'm still trying to cope with the fact that after the fence - node hangs in "pending".
>>>>>>>>>>>>>      Please define "pending".  Where did you see this?
>>>>>>>>>>>>      In crm_mon:
>>>>>>>>>>>>      ......
>>>>>>>>>>>>      Node dev-cluster2-node2 (172793105): pending
>>>>>>>>>>>>      ......
>>>>>>>>>>>>
>>>>>>>>>>>>      The experiment was like this:
>>>>>>>>>>>>      Four nodes in cluster.
>>>>>>>>>>>>      On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
>>>>>>>>>>>>      Thereafter, the remaining start it constantly reboot, under various pretexts, "softly whistling", "fly low", "not a cluster member!" ...
>>>>>>>>>>>>      Then in the log fell out "Too many failures ...."
>>>>>>>>>>>>      All this time in the status in crm_mon is "pending".
>>>>>>>>>>>>      Depending on the wind direction changed to "UNCLEAN"
>>>>>>>>>>>>      Much time has passed and I can not accurately describe the behavior...
>>>>>>>>>>>>
>>>>>>>>>>>>      Now I am in the following state:
>>>>>>>>>>>>      I tried locate the problem. Came here with this.
>>>>>>>>>>>>      I set big value in property stonith-timeout="600s".
>>>>>>>>>>>>      And got the following behavior:
>>>>>>>>>>>>      1. pkill -4 corosync
>>>>>>>>>>>>      2. from node with DC call my fence agent "sshbykey"
>>>>>>>>>>>>      3. It sends reboot victim and waits until she comes to life again.
>>>>>>>>>>>     Hmmm.... what version of pacemaker?
>>>>>>>>>>>     This sounds like a timing issue that we fixed a while back
>>>>>>>>>>    Was a version 1.1.11 from December 3.
>>>>>>>>>>    Now try full update and retest.
>>>>>>>>>   That should be recent enough.  Can you create a crm_report the next time you reproduce?
>>>>>>>>   Of course yes. Little delay.... :)
>>>>>>>>
>>>>>>>>   ......
>>>>>>>>   cc1: warnings being treated as errors
>>>>>>>>   upstart.c: In function ‘upstart_job_property’:
>>>>>>>>   upstart.c:264: error: implicit declaration of function ‘g_variant_lookup_value’
>>>>>>>>   upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’
>>>>>>>>   upstart.c:264: error: assignment makes pointer from integer without a cast
>>>>>>>>   gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
>>>>>>>>   gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
>>>>>>>>   make[1]: *** [all-recursive] Error 1
>>>>>>>>   make[1]: Leaving directory `/root/ha/pacemaker/lib'
>>>>>>>>   make: *** [core] Error 1
>>>>>>>>
>>>>>>>>   I'm trying to solve this a problem.
>>>>>>>   Do not get solved quickly...
>>>>>>>
>>>>>>>   https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
>>>>>>>   g_variant_lookup_value () Since 2.28
>>>>>>>
>>>>>>>   # yum list installed glib2
>>>>>>>   Loaded plugins: fastestmirror, rhnplugin, security
>>>>>>>   This system is receiving updates from RHN Classic or Red Hat Satellite.
>>>>>>>   Loading mirror speeds from cached hostfile
>>>>>>>   Installed Packages
>>>>>>>   glib2.x86_64                                                              2.26.1-3.el6                                                               installed
>>>>>>>
>>>>>>>   # cat /etc/issue
>>>>>>>   CentOS release 6.5 (Final)
>>>>>>>   Kernel \r on an \m
>>>>>>   Can you try this patch?
>>>>>>   Upstart jobs wont work, but the code will compile
>>>>>>
>>>>>>   diff --git a/lib/services/upstart.c b/lib/services/upstart.c
>>>>>>   index 831e7cf..195c3a4 100644
>>>>>>   --- a/lib/services/upstart.c
>>>>>>   +++ b/lib/services/upstart.c
>>>>>>   @@ -231,12 +231,21 @@ upstart_job_exists(const char *name)
>>>>>>   static char *
>>>>>>   upstart_job_property(const char *obj, const gchar * iface, const char *name)
>>>>>>   {
>>>>>>   +    char *output = NULL;
>>>>>>   +
>>>>>>   +#if !GLIB_CHECK_VERSION(2,28,0)
>>>>>>   +    static bool err = TRUE;
>>>>>>   +
>>>>>>   +    if(err) {
>>>>>>   +        crm_err("This version of glib is too old to support upstart jobs");
>>>>>>   +        err = FALSE;
>>>>>>   +    }
>>>>>>   +#else
>>>>>>      GError *error = NULL;
>>>>>>      GDBusProxy *proxy;
>>>>>>      GVariant *asv = NULL;
>>>>>>      GVariant *value = NULL;
>>>>>>      GVariant *_ret = NULL;
>>>>>>   -    char *output = NULL;
>>>>>>
>>>>>>      crm_info("Calling GetAll on %s", obj);
>>>>>>      proxy = get_proxy(obj, BUS_PROPERTY_IFACE);
>>>>>>   @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar * iface, const char *name)
>>>>>>
>>>>>>      g_object_unref(proxy);
>>>>>>      g_variant_unref(_ret);
>>>>>>   +#endif
>>>>>>      return output;
>>>>>>   }
>>>>>   Ok :) I patch source.
>>>>>   Type "make rc" - the same error.
>>>>   Because its not building your local changes
>>>>>   Make new copy via "fetch" - the same error.
>>>>>   It seems that if not exist ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then download it.
>>>>>   Otherwise use exist archive.
>>>>>   Cutted log .......
>>>>>
>>>>>   # make rc
>>>>>   make TAG=Pacemaker-1.1.11-rc3 rpm
>>>>>   make[1]: Entering directory `/root/ha/pacemaker'
>>>>>   rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* pacemaker-HEAD.tar.*
>>>>>   if [ ! -f ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz ]; then                                             \
>>>>>            rm -f pacemaker.tar.*;                                              \
>>>>>            if [ Pacemaker-1.1.11-rc3 = dirty ]; then                                   \
>>>>>                git commit -m "DO-NOT-PUSH" -a;                                 \
>>>>>                git archive --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ HEAD | gzip > ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;       \
>>>>>                git reset --mixed HEAD^;                                        \
>>>>>            else                                                                \
>>>>>                git archive --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ Pacemaker-1.1.11-rc3 | gzip > ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;     \
>>>>>            fi;                                                                 \
>>>>>            echo `date`: Rebuilt ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;                                     \
>>>>>        else                                                                    \
>>>>>            echo `date`: Using existing tarball: ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;                     \
>>>>>        fi
>>>>>   Mon Jan 13 13:23:21 MSK 2014: Using existing tarball: ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz
>>>>>   .......
>>>>>
>>>>>   Well, "make rpm" - build rpms and I create cluster.
>>>>>   I spent the same tests and confirmed the behavior.
>>>>>   crm_reoprt log here - http://send2me.ru/crmrep.tar.bz2
>>>>   Thanks!
>>  ,
>>  _______________________________________________
>>  Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org