[Pacemaker] hangs pending

Thu Feb 20 18:55:12 EST 2014

On 20 Feb 2014, at 10:04 pm, Andrey Groshev <greenx at yandex.ru> wrote:

> 
> 
> 20.02.2014, 13:57, "Andrew Beekhof" <andrew at beekhof.net>:
>> On 20 Feb 2014, at 5:33 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>> 
>>>  20.02.2014, 01:22, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>  On 20 Feb 2014, at 4:18 am, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>   19.02.2014, 06:47, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>   On 18 Feb 2014, at 9:29 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>    Hi, ALL and Andrew!
>>>>>>> 
>>>>>>>    Today is a good day - I killed a lot, and a lot of shooting at me.
>>>>>>>    In general - I am happy (almost like an elephant)   :)
>>>>>>>    Except resources on the node are important to me eight processes: corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>>>>    I killed them with different signals (4,6,11 and even 9).
>>>>>>>    Behavior does not depend of number signal - it's good.
>>>>>>>    If STONITH send reboot to the node - it rebooted and rejoined the cluster - too it's good.
>>>>>>>    But the behavior is different from killing various demons.
>>>>>>> 
>>>>>>>    Turned four groups:
>>>>>>>    1. corosync,cib - STONITH work 100%.
>>>>>>>    Kill via any signals - call STONITH and reboot.
>>>>>>   excellent
>>>>>>>    3. stonithd,attrd,pengine - not need STONITH
>>>>>>>    This daemons simple restart, resources - stay running.
>>>>>>   right
>>>>>>>    2. lrmd,crmd - strange behavior STONITH.
>>>>>>>    Sometimes called STONITH - and the corresponding reaction.
>>>>>>>    Sometimes restart daemon
>>>>>>   The daemon will always try to restart, the only variable is how long it takes the peer to notice and initiate fencing.
>>>>>>   If the failure happens just before a they're due to receive totem token, the failure will be very quickly detected and the node fenced.
>>>>>>   If the failure happens just after, then detection will take longer - giving the node longer to recover and not be fenced.
>>>>>> 
>>>>>>   So fence/not fence is normal and to be expected.
>>>>>>>    and restart resources with large delay MS:pgsql.
>>>>>>>    One time after restart crmd - pgsql don't restart.
>>>>>>   I would not expect pgsql to ever restart - if the RA does its job properly anyway.
>>>>>>   In the case the node is not fenced, the crmd will respawn and the the PE will request that it re-detect the state of all resources.
>>>>>> 
>>>>>>   If the agent reports "all good", then there is nothing more to do.
>>>>>>   If the agent is not reporting "all good", you should really be asking why.
>>>>>>>    4. pacemakerd - nothing happens.
>>>>>>   On non-systemd based machines, correct.
>>>>>> 
>>>>>>   On a systemd based machine pacemakerd is respawned and reattaches to the existing daemons.
>>>>>>   Any subsequent daemon failure will be detected and the daemon respawned.
>>>>>   And! I almost forgot about IT!
>>>>>   Exist another (NORMAL) the variants, the methods, the ideas?
>>>>>   Without this  ... @$%#$%&$%^&$%^&##@#$$^$%& !!!!!
>>>>>   Otherwise - it's a full epic fail ;)
>>>>  -ENOPARSE
>>>  OK, I remove my personal attitude to "systemd".
>>>  Let me explain.
>>> 
>>>  Somewhere in the beginning of this topic, I wrote:
>>>  A.G.:Who knows who runs lrmd?
>>>  A.B.:Pacemakerd.
>>>  That's one!
>>> 
>>>  Let's see the list of processes:
>>>  #ps -axf
>>>  .....
>>>  6067 ?        Ssl    7:24 corosync
>>>  6092 ?        S      0:25 pacemakerd
>>>  6094 ?        Ss   116:13  \_ /usr/libexec/pacemaker/cib
>>>  6095 ?        Ss     0:25  \_ /usr/libexec/pacemaker/stonithd
>>>  6096 ?        Ss     1:27  \_ /usr/libexec/pacemaker/lrmd
>>>  6097 ?        Ss     0:49  \_ /usr/libexec/pacemaker/attrd
>>>  6098 ?        Ss     0:25  \_ /usr/libexec/pacemaker/pengine
>>>  6099 ?        Ss     0:29  \_ /usr/libexec/pacemaker/crmd
>>>  .....
>>>  That's two!
>> 
>> Whats two?  I don't follow.
> In the sense that it creates other processes. But it does not matter.
> 
> 
>>>  And more, more...
>>>  Now you must understand - why I want this process to work always.
>>>  Even I think, No need for anyone here to explain it!
>>> 
>>>  And Now you say about "pacemakerd nice work, but only on systemd distros" !!!
>> 
>> No, I;m saying it works _better_ on systemd distros.
>> On non-systemd distros you still need quite a few unlikely-to-happen failures to trigger a situation in which the node still gets fenced and recovered (assuming no-one saw any of the error messages and didn't run "service pacemaker restart" prior to the additional failures).
>> 
> Can you show me the place where:
> "On a systemd based machine pacemakerd is respawned and reattaches to the existing daemons."?

The code for it is in mcp/pacemaker.c, look for find_and_track_existing_processes()

The ps tree will look different though

 6094 ?        Ss   116:13  /usr/libexec/pacemaker/cib
 6095 ?        Ss     0:25  /usr/libexec/pacemaker/stonithd
 6096 ?        Ss     1:27  /usr/libexec/pacemaker/lrmd
 6097 ?        Ss     0:49  /usr/libexec/pacemaker/attrd
 6098 ?        Ss     0:25  /usr/libexec/pacemaker/pengine
 6099 ?        Ss     0:29  /usr/libexec/pacemaker/crmd
...
 6666 ?        S      0:25 pacemakerd

but pacemakerd will be watching the old children and respawning them on failure.
at which point you might see:

 6094 ?        Ss   116:13  /usr/libexec/pacemaker/cib
 6096 ?        Ss     1:27  /usr/libexec/pacemaker/lrmd
 6097 ?        Ss     0:49  /usr/libexec/pacemaker/attrd
 6098 ?        Ss     0:25  /usr/libexec/pacemaker/pengine
 6099 ?        Ss     0:29  /usr/libexec/pacemaker/crmd
...
 6666 ?        S      0:25 pacemakerd
 6667 ?        Ss     0:25 \_ /usr/libexec/pacemaker/stonithd

> If I respawn via upstart process pacemakerd - "reattaches to the existing daemons" ?

If upstart is capable of detecting the pacemakerd failure and automagically respawning it, then yes - the same process will happen.

> 
>>>  What should I do now?
>>>  * Integrate systemd in CentOS?
>>>  * Migrate to Fefora?
>>>  * Buy RHEL7 !?
>> 
>> Option 3 is particularly good :)
> 
> It's too easy. Normal heroes are always going to bypass :) 
> 
>>>  Each a variants is great, but don't fit for me.
>>> 
>>>  P.S. And I'm not talking distros which don't migrate to systemd (and will not do).
>> 
>> Are there any?  Even debian and ubuntu have raised the white flag.
> 
> It certainly a lyrics, but potentially it can be any Unix-like system.
> 
> 
>>>  Do not be offended! We also do so.
>>>  We are building a secret military factory,
>>>  large concrete fence around it,
>>>  wall barbed wire, but forget to install the gates. :)
>>>>>>>    And then I can kill any process of the third group. They do not restart.
>>>>>>   Until they become needed.
>>>>>>   Eg. if the DC goes to invoke the policy engine, that will fail causing the crmd to fail and the node to be fenced.
>>>>>>>    Generaly don't touch corosync,cib and maybe lrmd,crmd.
>>>>>>> 
>>>>>>>    What do you think about this?
>>>>>>>    The main question of this topic - we decided.
>>>>>>>    But this varied behavior - another big problem.
>>>>>>> 
>>>>>>>    17.02.2014, 08:52, "Andrey Groshev" <greenx at yandex.ru>:
>>>>>>>>    17.02.2014, 02:27, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>     With no quick follow-up, dare one hope that means the patch worked? :-)
>>>>>>>>    Hi,
>>>>>>>>    No, unfortunately the chief changed my plans on Friday and all day I was engaged in a parallel project.
>>>>>>>>    I hope that today have time to carry out the necessary tests.
>>>>>>>>>     On 14 Feb 2014, at 3:37 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>      Yes, of course. Now beginning build world and test )
>>>>>>>>>> 
>>>>>>>>>>      14.02.2014, 04:41, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>      The previous patch wasn't quite right.
>>>>>>>>>>>      Could you try this new one?
>>>>>>>>>>> 
>>>>>>>>>>>         http://paste.fedoraproject.org/77123/13923376/
>>>>>>>>>>> 
>>>>>>>>>>>      [11:23 AM] beekhof at f19 ~/Development/sources/pacemaker/devel ☺ # git diff
>>>>>>>>>>>      diff --git a/crmd/callbacks.c b/crmd/callbacks.c
>>>>>>>>>>>      index ac4b905..d49525b 100644
>>>>>>>>>>>      --- a/crmd/callbacks.c
>>>>>>>>>>>      +++ b/crmd/callbacks.c
>>>>>>>>>>>      @@ -199,8 +199,7 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d
>>>>>>>>>>>                       stop_te_timer(down->timer);
>>>>>>>>>>> 
>>>>>>>>>>>                       flags |= node_update_join | node_update_expected;
>>>>>>>>>>>      -                crm_update_peer_join(__FUNCTION__, node, crm_join_none);
>>>>>>>>>>>      -                crm_update_peer_expected(__FUNCTION__, node, CRMD_JOINSTATE_DOWN);
>>>>>>>>>>>      +                crmd_peer_down(node, FALSE);
>>>>>>>>>>>                       check_join_state(fsa_state, __FUNCTION__);
>>>>>>>>>>> 
>>>>>>>>>>>                       update_graph(transition_graph, down);
>>>>>>>>>>>      diff --git a/crmd/crmd_utils.h b/crmd/crmd_utils.h
>>>>>>>>>>>      index bc472c2..1a2577a 100644
>>>>>>>>>>>      --- a/crmd/crmd_utils.h
>>>>>>>>>>>      +++ b/crmd/crmd_utils.h
>>>>>>>>>>>      @@ -100,6 +100,7 @@ void crmd_join_phase_log(int level);
>>>>>>>>>>>       const char *get_timer_desc(fsa_timer_t * timer);
>>>>>>>>>>>       gboolean too_many_st_failures(void);
>>>>>>>>>>>       void st_fail_count_reset(const char * target);
>>>>>>>>>>>      +void crmd_peer_down(crm_node_t *peer, bool full);
>>>>>>>>>>> 
>>>>>>>>>>>       #  define fsa_register_cib_callback(id, flag, data, fn) do {              \
>>>>>>>>>>>               fsa_cib_conn->cmds->register_callback(                          \
>>>>>>>>>>>      diff --git a/crmd/te_actions.c b/crmd/te_actions.c
>>>>>>>>>>>      index f31d4ec..3bfce59 100644
>>>>>>>>>>>      --- a/crmd/te_actions.c
>>>>>>>>>>>      +++ b/crmd/te_actions.c
>>>>>>>>>>>      @@ -80,11 +80,8 @@ send_stonith_update(crm_action_t * action, const char *target, const char *uuid)
>>>>>>>>>>>               crm_info("Recording uuid '%s' for node '%s'", uuid, target);
>>>>>>>>>>>               peer->uuid = strdup(uuid);
>>>>>>>>>>>           }
>>>>>>>>>>>      -    crm_update_peer_proc(__FUNCTION__, peer, crm_proc_none, NULL);
>>>>>>>>>>>      -    crm_update_peer_state(__FUNCTION__, peer, CRM_NODE_LOST, 0);
>>>>>>>>>>>      -    crm_update_peer_expected(__FUNCTION__, peer, CRMD_JOINSTATE_DOWN);
>>>>>>>>>>>      -    crm_update_peer_join(__FUNCTION__, peer, crm_join_none);
>>>>>>>>>>> 
>>>>>>>>>>>      +    crmd_peer_down(peer, TRUE);
>>>>>>>>>>>           node_state =
>>>>>>>>>>>               do_update_node_cib(peer,
>>>>>>>>>>>                                  node_update_cluster | node_update_peer | node_update_join |
>>>>>>>>>>>      diff --git a/crmd/te_utils.c b/crmd/te_utils.c
>>>>>>>>>>>      index ad7e573..0c92e95 100644
>>>>>>>>>>>      --- a/crmd/te_utils.c
>>>>>>>>>>>      +++ b/crmd/te_utils.c
>>>>>>>>>>>      @@ -247,10 +247,7 @@ tengine_stonith_notify(stonith_t * st, stonith_event_t * st_event)
>>>>>>>>>>> 
>>>>>>>>>>>               }
>>>>>>>>>>> 
>>>>>>>>>>>      -        crm_update_peer_proc(__FUNCTION__, peer, crm_proc_none, NULL);
>>>>>>>>>>>      -        crm_update_peer_state(__FUNCTION__, peer, CRM_NODE_LOST, 0);
>>>>>>>>>>>      -        crm_update_peer_expected(__FUNCTION__, peer, CRMD_JOINSTATE_DOWN);
>>>>>>>>>>>      -        crm_update_peer_join(__FUNCTION__, peer, crm_join_none);
>>>>>>>>>>>      +        crmd_peer_down(peer, TRUE);
>>>>>>>>>>>            }
>>>>>>>>>>>       }
>>>>>>>>>>> 
>>>>>>>>>>>      diff --git a/crmd/utils.c b/crmd/utils.c
>>>>>>>>>>>      index 3988cfe..2df53ab 100644
>>>>>>>>>>>      --- a/crmd/utils.c
>>>>>>>>>>>      +++ b/crmd/utils.c
>>>>>>>>>>>      @@ -1077,3 +1077,13 @@ update_attrd_remote_node_removed(const char *host, const char *user_name)
>>>>>>>>>>>           crm_trace("telling attrd to clear attributes for remote host %s", host);
>>>>>>>>>>>           update_attrd_helper(host, NULL, NULL, user_name, TRUE, 'C');
>>>>>>>>>>>       }
>>>>>>>>>>>      +
>>>>>>>>>>>      +void crmd_peer_down(crm_node_t *peer, bool full)
>>>>>>>>>>>      +{
>>>>>>>>>>>      +    if(full && peer->state == NULL) {
>>>>>>>>>>>      +        crm_update_peer_state(__FUNCTION__, peer, CRM_NODE_LOST, 0);
>>>>>>>>>>>      +        crm_update_peer_proc(__FUNCTION__, peer, crm_proc_none, NULL);
>>>>>>>>>>>      +    }
>>>>>>>>>>>      +    crm_update_peer_join(__FUNCTION__, peer, crm_join_none);
>>>>>>>>>>>      +    crm_update_peer_expected(__FUNCTION__, peer, CRMD_JOINSTATE_DOWN);
>>>>>>>>>>>      +}
>>>>>>>>>>> 
>>>>>>>>>>>      On 16 Jan 2014, at 7:24 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>       16.01.2014, 01:30, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>       On 16 Jan 2014, at 12:41 am, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>        15.01.2014, 02:53, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>>>        On 15 Jan 2014, at 12:15 am, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>>>         14.01.2014, 10:00, "Andrey Groshev" <greenx at yandex.ru>:
>>>>>>>>>>>>>>>>>         14.01.2014, 07:47, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>>>>>>          Ok, here's what happens:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>          1. node2 is lost
>>>>>>>>>>>>>>>>>>          2. fencing of node2 starts
>>>>>>>>>>>>>>>>>>          3. node2 reboots (and cluster starts)
>>>>>>>>>>>>>>>>>>          4. node2 returns to the membership
>>>>>>>>>>>>>>>>>>          5. node2 is marked as a cluster member
>>>>>>>>>>>>>>>>>>          6. DC tries to bring it into the cluster, but needs to cancel the active transition first.
>>>>>>>>>>>>>>>>>>             Which is a problem since the node2 fencing operation is part of that
>>>>>>>>>>>>>>>>>>          7. node2 is in a transition (pending) state until fencing passes or fails
>>>>>>>>>>>>>>>>>>          8a. fencing fails: transition completes and the node joins the cluster
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>          Thats in theory, except we automatically try again. Which isn't appropriate.
>>>>>>>>>>>>>>>>>>          This should be relatively easy to fix.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>          8b. fencing passes: the node is incorrectly marked as offline
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>          This I have no idea how to fix yet.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>          On another note, it doesn't look like this agent works at all.
>>>>>>>>>>>>>>>>>>          The node has been back online for a long time and the agent is still timing out after 10 minutes.
>>>>>>>>>>>>>>>>>>          So "Once the script makes sure that the victim will rebooted and again available via ssh - it exit with 0." does not seem true.
>>>>>>>>>>>>>>>>>         Damn. Looks like you're right. At some time I broke my agent and had not noticed it. Who will understand.
>>>>>>>>>>>>>>>>         I repaired my agent - after send reboot he is wait STDIN.
>>>>>>>>>>>>>>>>         Returned "normally" a behavior - hangs "pending", until manually send reboot. :)
>>>>>>>>>>>>>>>        Right. Now you're in case 8b.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>        Can you try this patch:  http://paste.fedoraproject.org/68450/38973966
>>>>>>>>>>>>>>        Killed all day experiences.
>>>>>>>>>>>>>>        It turns out here that:
>>>>>>>>>>>>>>        1. Did cluster.
>>>>>>>>>>>>>>        2. On the node-2 send signal (-4) - killed corosink
>>>>>>>>>>>>>>        3. From node-1 (there DC) - stonith sent reboot
>>>>>>>>>>>>>>        4. Noda rebooted and resources start.
>>>>>>>>>>>>>>        5. Again. On the node-2 send signal (-4) - killed corosink
>>>>>>>>>>>>>>        6. Again. From node-1 (there DC) - stonith sent reboot
>>>>>>>>>>>>>>        7. Noda-2 rebooted and hangs in "pending"
>>>>>>>>>>>>>>        8. Waiting, waiting..... manually reboot.
>>>>>>>>>>>>>>        9. Noda-2 reboot and raised resources start.
>>>>>>>>>>>>>>        10. GOTO p.2
>>>>>>>>>>>>>       Logs?
>>>>>>>>>>>>       Yesterday I wrote an additional letter why not put the logs.
>>>>>>>>>>>>       Read it please, it contains a few more questions.
>>>>>>>>>>>>       Today again began to hang and continue along the same cycle.
>>>>>>>>>>>>       Logs here http://send2me.ru/crmrep2.tar.bz2
>>>>>>>>>>>>>>>>         New logs: http://send2me.ru/crmrep1.tar.bz2
>>>>>>>>>>>>>>>>>>          On 14 Jan 2014, at 1:19 pm, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>>>>>>>>>>>>>>>>>           Apart from anything else, your timeout needs to be bigger:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>           Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: (  commands.c:1321  )   error: log_operation: Operation 'reboot' [11331] (call 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 'st1' returned: -62 (Timer expired)
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>           On 14 Jan 2014, at 7:18 am, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>>>>>>>>>>>>>>>>>>           On 13 Jan 2014, at 8:31 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>>>>>>>>           13.01.2014, 02:51, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>>>>>>>>>>           On 10 Jan 2014, at 9:55 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>>>>>>>>>>           10.01.2014, 14:31, "Andrey Groshev" <greenx at yandex.ru>:
>>>>>>>>>>>>>>>>>>>>>>>>           10.01.2014, 14:01, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>>>>>>>>>>>>>           On 10 Jan 2014, at 5:03 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>            10.01.2014, 05:29, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>>>>>>>>>>>>>>>             On 9 Jan 2014, at 11:11 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              08.01.2014, 06:22, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>              On 29 Nov 2013, at 7:17 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>               Hi, ALL.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>               I'm still trying to cope with the fact that after the fence - node hangs in "pending".
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>              Please define "pending".  Where did you see this?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              In crm_mon:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              ......
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              Node dev-cluster2-node2 (172793105): pending
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              ......
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              The experiment was like this:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              Four nodes in cluster.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              Thereafter, the remaining start it constantly reboot, under various pretexts, "softly whistling", "fly low", "not a cluster member!" ...
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              Then in the log fell out "Too many failures ...."
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              All this time in the status in crm_mon is "pending".
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              Depending on the wind direction changed to "UNCLEAN"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              Much time has passed and I can not accurately describe the behavior...
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              Now I am in the following state:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              I tried locate the problem. Came here with this.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              I set big value in property stonith-timeout="600s".
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              And got the following behavior:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              1. pkill -4 corosync
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              2. from node with DC call my fence agent "sshbykey"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              3. It sends reboot victim and waits until she comes to life again.
>>>>>>>>>>>>>>>>>>>>>>>>>>>             Hmmm.... what version of pacemaker?
>>>>>>>>>>>>>>>>>>>>>>>>>>>             This sounds like a timing issue that we fixed a while back
>>>>>>>>>>>>>>>>>>>>>>>>>>            Was a version 1.1.11 from December 3.
>>>>>>>>>>>>>>>>>>>>>>>>>>            Now try full update and retest.
>>>>>>>>>>>>>>>>>>>>>>>>>           That should be recent enough.  Can you create a crm_report the next time you reproduce?
>>>>>>>>>>>>>>>>>>>>>>>>           Of course yes. Little delay.... :)
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>           ......
>>>>>>>>>>>>>>>>>>>>>>>>           cc1: warnings being treated as errors
>>>>>>>>>>>>>>>>>>>>>>>>           upstart.c: In function ‘upstart_job_property’:
>>>>>>>>>>>>>>>>>>>>>>>>           upstart.c:264: error: implicit declaration of function ‘g_variant_lookup_value’
>>>>>>>>>>>>>>>>>>>>>>>>           upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’
>>>>>>>>>>>>>>>>>>>>>>>>           upstart.c:264: error: assignment makes pointer from integer without a cast
>>>>>>>>>>>>>>>>>>>>>>>>           gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
>>>>>>>>>>>>>>>>>>>>>>>>           gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
>>>>>>>>>>>>>>>>>>>>>>>>           make[1]: *** [all-recursive] Error 1
>>>>>>>>>>>>>>>>>>>>>>>>           make[1]: Leaving directory `/root/ha/pacemaker/lib'
>>>>>>>>>>>>>>>>>>>>>>>>           make: *** [core] Error 1
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>           I'm trying to solve this a problem.
>>>>>>>>>>>>>>>>>>>>>>>           Do not get solved quickly...
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>           https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
>>>>>>>>>>>>>>>>>>>>>>>           g_variant_lookup_value () Since 2.28
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>           # yum list installed glib2
>>>>>>>>>>>>>>>>>>>>>>>           Loaded plugins: fastestmirror, rhnplugin, security
>>>>>>>>>>>>>>>>>>>>>>>           This system is receiving updates from RHN Classic or Red Hat Satellite.
>>>>>>>>>>>>>>>>>>>>>>>           Loading mirror speeds from cached hostfile
>>>>>>>>>>>>>>>>>>>>>>>           Installed Packages
>>>>>>>>>>>>>>>>>>>>>>>           glib2.x86_64                                                              2.26.1-3.el6                                                               installed
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>           # cat /etc/issue
>>>>>>>>>>>>>>>>>>>>>>>           CentOS release 6.5 (Final)
>>>>>>>>>>>>>>>>>>>>>>>           Kernel \r on an \m
>>>>>>>>>>>>>>>>>>>>>>           Can you try this patch?
>>>>>>>>>>>>>>>>>>>>>>           Upstart jobs wont work, but the code will compile
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>           diff --git a/lib/services/upstart.c b/lib/services/upstart.c
>>>>>>>>>>>>>>>>>>>>>>           index 831e7cf..195c3a4 100644
>>>>>>>>>>>>>>>>>>>>>>           --- a/lib/services/upstart.c
>>>>>>>>>>>>>>>>>>>>>>           +++ b/lib/services/upstart.c
>>>>>>>>>>>>>>>>>>>>>>           @@ -231,12 +231,21 @@ upstart_job_exists(const char *name)
>>>>>>>>>>>>>>>>>>>>>>           static char *
>>>>>>>>>>>>>>>>>>>>>>           upstart_job_property(const char *obj, const gchar * iface, const char *name)
>>>>>>>>>>>>>>>>>>>>>>           {
>>>>>>>>>>>>>>>>>>>>>>           +    char *output = NULL;
>>>>>>>>>>>>>>>>>>>>>>           +
>>>>>>>>>>>>>>>>>>>>>>           +#if !GLIB_CHECK_VERSION(2,28,0)
>>>>>>>>>>>>>>>>>>>>>>           +    static bool err = TRUE;
>>>>>>>>>>>>>>>>>>>>>>           +
>>>>>>>>>>>>>>>>>>>>>>           +    if(err) {
>>>>>>>>>>>>>>>>>>>>>>           +        crm_err("This version of glib is too old to support upstart jobs");
>>>>>>>>>>>>>>>>>>>>>>           +        err = FALSE;
>>>>>>>>>>>>>>>>>>>>>>           +    }
>>>>>>>>>>>>>>>>>>>>>>           +#else
>>>>>>>>>>>>>>>>>>>>>>              GError *error = NULL;
>>>>>>>>>>>>>>>>>>>>>>              GDBusProxy *proxy;
>>>>>>>>>>>>>>>>>>>>>>              GVariant *asv = NULL;
>>>>>>>>>>>>>>>>>>>>>>              GVariant *value = NULL;
>>>>>>>>>>>>>>>>>>>>>>              GVariant *_ret = NULL;
>>>>>>>>>>>>>>>>>>>>>>           -    char *output = NULL;
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>              crm_info("Calling GetAll on %s", obj);
>>>>>>>>>>>>>>>>>>>>>>              proxy = get_proxy(obj, BUS_PROPERTY_IFACE);
>>>>>>>>>>>>>>>>>>>>>>           @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar * iface, const char *name)
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>              g_object_unref(proxy);
>>>>>>>>>>>>>>>>>>>>>>              g_variant_unref(_ret);
>>>>>>>>>>>>>>>>>>>>>>           +#endif
>>>>>>>>>>>>>>>>>>>>>>              return output;
>>>>>>>>>>>>>>>>>>>>>>           }
>>>>>>>>>>>>>>>>>>>>>           Ok :) I patch source.
>>>>>>>>>>>>>>>>>>>>>           Type "make rc" - the same error.
>>>>>>>>>>>>>>>>>>>>           Because its not building your local changes
>>>>>>>>>>>>>>>>>>>>>           Make new copy via "fetch" - the same error.
>>>>>>>>>>>>>>>>>>>>>           It seems that if not exist ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then download it.
>>>>>>>>>>>>>>>>>>>>>           Otherwise use exist archive.
>>>>>>>>>>>>>>>>>>>>>           Cutted log .......
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>           # make rc
>>>>>>>>>>>>>>>>>>>>>           make TAG=Pacemaker-1.1.11-rc3 rpm
>>>>>>>>>>>>>>>>>>>>>           make[1]: Entering directory `/root/ha/pacemaker'
>>>>>>>>>>>>>>>>>>>>>           rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* pacemaker-HEAD.tar.*
>>>>>>>>>>>>>>>>>>>>>           if [ ! -f ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz ]; then                                             \
>>>>>>>>>>>>>>>>>>>>>                    rm -f pacemaker.tar.*;                                              \
>>>>>>>>>>>>>>>>>>>>>                    if [ Pacemaker-1.1.11-rc3 = dirty ]; then                                   \
>>>>>>>>>>>>>>>>>>>>>                        git commit -m "DO-NOT-PUSH" -a;                                 \
>>>>>>>>>>>>>>>>>>>>>                        git archive --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ HEAD | gzip > ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;       \
>>>>>>>>>>>>>>>>>>>>>                        git reset --mixed HEAD^;                                        \
>>>>>>>>>>>>>>>>>>>>>                    else                                                                \
>>>>>>>>>>>>>>>>>>>>>                        git archive --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ Pacemaker-1.1.11-rc3 | gzip > ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;     \
>>>>>>>>>>>>>>>>>>>>>                    fi;                                                                 \
>>>>>>>>>>>>>>>>>>>>>                    echo `date`: Rebuilt ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;                                     \
>>>>>>>>>>>>>>>>>>>>>                else                                                                    \
>>>>>>>>>>>>>>>>>>>>>                    echo `date`: Using existing tarball: ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;                     \
>>>>>>>>>>>>>>>>>>>>>                fi
>>>>>>>>>>>>>>>>>>>>>           Mon Jan 13 13:23:21 MSK 2014: Using existing tarball: ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz
>>>>>>>>>>>>>>>>>>>>>           .......
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>           Well, "make rpm" - build rpms and I create cluster.
>>>>>>>>>>>>>>>>>>>>>           I spent the same tests and confirmed the behavior.
>>>>>>>>>>>>>>>>>>>>>           crm_reoprt log here - http://send2me.ru/crmrep.tar.bz2
>>>>>>>>>>>>>>>>>>>>           Thanks!
>>>>>>>>>>>>>>>>>>          ,
>>>>>>>>>>>>>>>>>>          _______________________________________________
>>>>>>>>>>>>>>>>>>          Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>>>>>>>>          http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>          Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>>>>>>          Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>>>>>>          Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>>>>>         _______________________________________________
>>>>>>>>>>>>>>>>>         Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>>>>>>>         http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>         Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>>>>>         Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>>>>>         Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>>>>         _______________________________________________
>>>>>>>>>>>>>>>>         Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>>>>>>         http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>         Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>>>>         Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>>>>         Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>>>        ,
>>>>>>>>>>>>>>>        _______________________________________________
>>>>>>>>>>>>>>>        Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>>>>>        http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>        Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>>>        Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>>>        Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>>        _______________________________________________
>>>>>>>>>>>>>>        Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>>>>        http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>        Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>>        Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>>        Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>       ,
>>>>>>>>>>>>>       _______________________________________________
>>>>>>>>>>>>>       Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>>>       http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>> 
>>>>>>>>>>>>>       Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>       Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>       Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>       _______________________________________________
>>>>>>>>>>>>       Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>>       http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>> 
>>>>>>>>>>>>       Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>       Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>       Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>      ,
>>>>>>>>>>>      _______________________________________________
>>>>>>>>>>>      Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>      http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>> 
>>>>>>>>>>>      Project Home: http://www.clusterlabs.org
>>>>>>>>>>>      Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>      Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>      _______________________________________________
>>>>>>>>>>      Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>      http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>> 
>>>>>>>>>>      Project Home: http://www.clusterlabs.org
>>>>>>>>>>      Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>      Bugs: http://bugs.clusterlabs.org
>>>>>>>>>     ,
>>>>>>>>>     _______________________________________________
>>>>>>>>>     Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>     http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>> 
>>>>>>>>>     Project Home: http://www.clusterlabs.org
>>>>>>>>>     Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>     Bugs: http://bugs.clusterlabs.org
>>>>>>>>    _______________________________________________
>>>>>>>>    Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>    http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>> 
>>>>>>>>    Project Home: http://www.clusterlabs.org
>>>>>>>>    Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>    Bugs: http://bugs.clusterlabs.org
>>>>>>>    _______________________________________________
>>>>>>>    Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>    http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>> 
>>>>>>>    Project Home: http://www.clusterlabs.org
>>>>>>>    Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>    Bugs: http://bugs.clusterlabs.org
>>>>>>   ,
>>>>>>   _______________________________________________
>>>>>>   Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>> 
>>>>>>   Project Home: http://www.clusterlabs.org
>>>>>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>   Bugs: http://bugs.clusterlabs.org
>>>>>   _______________________________________________
>>>>>   Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>> 
>>>>>   Project Home: http://www.clusterlabs.org
>>>>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>   Bugs: http://bugs.clusterlabs.org
>>>>  ,
>>>>  _______________________________________________
>>>>  Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> 
>>>>  Project Home: http://www.clusterlabs.org
>>>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>  Bugs: http://bugs.clusterlabs.org
>>>  _______________________________________________
>>>  Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>>  Project Home: http://www.clusterlabs.org
>>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>  Bugs: http://bugs.clusterlabs.org
>> 
>> ,
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140221/186ad0b7/attachment-0003.sig>