[Pacemaker] node status does not change even if pacemakerd dies

Tue Jan 22 10:09:37 UTC 2013

(13.01.10 13:35), Andrew Beekhof wrote:
> On Wed, Jan 9, 2013 at 8:57 PM, Kazunori INOUE
> <inouekazu at intellilink.co.jp> wrote:
>> Hi Andrew,
>>
>> I have another question about this subject.
>> Even if pengine, stonithd, and attrd crash after pacemakerd is killed
>> (for example, killed by OOM_Killer), node status does not change.
>>
>> * pseudo testcase
>>
>>   [dev1 ~]$ crm configure show
>>   node $id="2472913088" dev2
>>   node $id="2506467520" dev1
>>   primitive prmDummy ocf:pacemaker:Dummy \
>>           op monitor on-fail="restart" interval="10s"
>>   property $id="cib-bootstrap-options" \
>>           dc-version="1.1.8-d20d06f" \
>>           cluster-infrastructure="corosync" \
>>           no-quorum-policy="ignore" \
>>           stonith-enabled="false" \
>>           startup-fencing="false"
>>   rsc_defaults $id="rsc-options" \
>>           resource-stickiness="INFINITY" \
>>           migration-threshold="1"
>>
>>
>>   [dev1 ~]$ pkill -9 pacemakerd
>>   [dev1 ~]$ pkill -9 pengine
>>   [dev1 ~]$ pkill -9 stonithd
>>   [dev1 ~]$ pkill -9 attrd
>
>  From http://linux-mm.org/OOM_Killer
>
>   * 2) we recover a large amount of memory
>   * 3) we don't kill anything innocent of eating tons of memory
>   * 4) we want to kill the minimum amount of processes (one)
>
> pacemakerd doesn't meet any of these criteria and is probably the last
> process that would ever be killed.
> It uses orders of magnitude less memory than corosync and the cib for
> example - so those would be among the first to go.
>
> The order you'd need to kill things to match the OOM killer is:
>
>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 20319 root      RT   0  409m  85m  58m S  0.0 17.4   0:14.45 corosync
> 20611 hacluste  20   0  115m  19m  17m S  0.0  4.0   0:02.85 pengine
> 20607 hacluste  20   0 97908  12m 9572 S  0.0  2.6   0:03.45 cib
> 20612 root      20   0  151m  11m 9568 S  0.0  2.3   0:03.02 crmd
> 20608 root      20   0 92036 8832 7636 S  0.0  1.8   0:02.22 stonithd
> 20609 root      20   0 73216 3180 2420 S  0.0  0.6   0:02.88 lrmd
> 20610 hacluste  20   0 85868 3120 2356 S  0.0  0.6   0:02.21 attrd
> 20601 root      20   0 80356 2960 2232 S  0.0  0.6   0:02.98
> pacemakerd
>
>
> So you can't just "kill -9" a specific combination of processes and
> say "OOM Killer" to make it a plausible test case.
> Also, with stonith disabled, this scenario is honestly the least of
> your problems.
>
> HOWEVER...
>
> As long as the cib, lrmd, and crmd are around, the cluster, while
> degraded, is still able to perform its primary functions (start/stop
> processes and do health checks).
> So not sending it offline is reasonable.  If you had done this on the
> DC you would have seen a different result.
>
> The question I ask in these cases is, "what do we gain by having
> pacemaker exit?".
> Particularly with stonith turned off, the answer here is worse than nothing...
> At best you have the services running on a node without pacemaker, at
> worst the cluster starts them on the second node as well.
>
> Reporting the node as healthy however, is clearly not good.  We
> absolutely need to mark it as degraded somehow.
>
> David and I talked this morning about potentially putting the node
> automatically into standby (it can still probe for services in that
> state) if certain processes die as well as ensuring it never wins a DC
> election.
> These are the things I would prefer to invest time into rather than
> always resorting to the exit(1) hammer.
>
> Restarting for every error is something that was only ever meant to be
> temporary, note the creation date on:
>     https://developerbugs.linuxfoundation.org/show_bug.cgi?id=66
>

Hi Andrew,

I understood that pacemakerd was not killed by OOM Killer.
However, because process failure may occur under the unexpected
circumstances, we let Upstart manage pacemakerd.

Thanks,

Kazunori INOUE

>>
>>   [dev1 ~]$ ps -ef|egrep 'corosync|pacemaker'
>>   root   19124    1  0 14:27 ?     00:00:01 corosync
>>   496    19144    1  0 14:27 ?     00:00:00 /usr/libexec/pacemaker/cib
>>   root   19146    1  0 14:27 ?     00:00:00 /usr/libexec/pacemaker/lrmd
>>   496    19149    1  0 14:27 ?     00:00:00 /usr/libexec/pacemaker/crmd
>>
>>   [dev1 ~]$ crm_mon -1
>>
>>    :
>>   Stack: corosync
>>   Current DC: dev2 (2472913088) - partition with quorum
>>   Version: 1.1.8-d20d06f
>>
>>   2 Nodes configured, unknown expected votes
>>   1 Resources configured.
>>
>>
>>   Online: [ dev1 dev2 ]
>>
>>    prmDummy       (ocf::pacemaker:Dummy): Started dev1
>>
>> Node (dev1) remains Online.
>> When other processes such as lrmd crash, it becomes "UNCLEAN (offline)".
>> Is this a bug? Or specifications?
>>
>> Best Regards,
>> Kazunori INOUE
>>
>>
>>
>> (13.01.08 09:16), Andrew Beekhof wrote:
>>>
>>> On Wed, Dec 19, 2012 at 8:15 PM, Kazunori INOUE
>>> <inouekazu at intellilink.co.jp> wrote:
>>>>
>>>> (12.12.13 08:26), Andrew Beekhof wrote:
>>>>>
>>>>>
>>>>> On Wed, Dec 12, 2012 at 8:02 PM, Kazunori INOUE
>>>>> <inouekazu at intellilink.co.jp> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I recognize that pacemakerd is much less likely to crash.
>>>>>> However, a possibility of being killed by OOM_Killer etc. is not 0%.
>>>>>
>>>>>
>>>>>
>>>>> True.  Although we just established in another thread that we don't
>>>>> have any leaks :)
>>>>>
>>>>>> So I think that a user gets confused. since behavior at the time of
>>>>>> process
>>>>>> death differs even if pacemakerd is running.
>>>>>>
>>>>>> case A)
>>>>>>     When pacemakerd and other processes (crmd etc.) are the parent-child
>>>>>> relation.
>>>>>>
>>>>>
>>>>> [snip]
>>>>>
>>>>>>
>>>>>>     For example, crmd died.
>>>>>>     However, since it is relaunched, the state of the cluster is not
>>>>>> affected.
>>>>>
>>>>>
>>>>>
>>>>> Right.
>>>>>
>>>>> [snip]
>>>>>
>>>>>>
>>>>>> case B)
>>>>>>     When pacemakerd and other processes are NOT the parent-child
>>>>>> relation.
>>>>>>     Although pacemakerd was killed, it assumed the state where it was
>>>>>> respawned
>>>>>> by Upstart.
>>>>>>
>>>>>>      $ service corosync start ; service pacemaker start
>>>>>>      $ pkill -9 pacemakerd
>>>>>>      $ ps -ef|egrep 'corosync|pacemaker|UID'
>>>>>>      UID      PID  PPID  C STIME TTY       TIME CMD
>>>>>>      root   21091     1  1 14:52 ?     00:00:00 corosync
>>>>>>      496    21099     1  0 14:52 ?     00:00:00
>>>>>> /usr/libexec/pacemaker/cib
>>>>>>      root   21100     1  0 14:52 ?     00:00:00
>>>>>> /usr/libexec/pacemaker/stonithd
>>>>>>      root   21101     1  0 14:52 ?     00:00:00
>>>>>> /usr/libexec/pacemaker/lrmd
>>>>>>      496    21102     1  0 14:52 ?     00:00:00
>>>>>> /usr/libexec/pacemaker/attrd
>>>>>>      496    21103     1  0 14:52 ?     00:00:00
>>>>>> /usr/libexec/pacemaker/pengine
>>>>>>      496    21104     1  0 14:52 ?     00:00:00
>>>>>> /usr/libexec/pacemaker/crmd
>>>>>>      root   21128     1  1 14:53 ?     00:00:00 /usr/sbin/pacemakerd
>>>>>
>>>>>
>>>>>
>>>>> Yep, looks right.
>>>>>
>>>>
>>>> Hi Andrew,
>>>>
>>>> We discussed this behavior.
>>>> Behavior when pacemakerd and other processes are not parent-child
>>>> relation (case B) reached the conclusion that there is room for
>>>> improvement.
>>>>
>>>> Since not all users are experts, they may kill pacemakerd accidentally.
>>>> Such a user will get confused if the behavior after crmd death changes
>>>> with the following conditions.
>>>> case A: pacemakerd and others (crmd etc.) are the parent-child relation.
>>>> case B: pacemakerd and others are not the parent-child relation.
>>>>
>>>> So, we want to *always* obtain the same behavior as the case where
>>>> there is parent-child relation.
>>>> That is, when crmd etc. die, we want pacemaker to always relaunch
>>>> the process always immediately.
>>>
>>>
>>> No. Sorry.
>>> Writing features to satisfy an artificial test case is not a good
>>> practice.
>>>
>>> We can speed up the failure detection for case B (I'll agree that 60s
>>> is way too long, 5s or 2s might be better depending on the load is
>>> creates), but causing downtime now to _maybe_ avoid downtime in the
>>> future makes no sense.
>>> Especially when you consider that the node will likely be fenced if
>>> the crmd fails anyway.
>>>
>>> Take a look at the logs from a some ComponentFail test runs and you'll
>>> see that the parent-child relationship regularly _fails_ to prevent
>>> downtime.
>>>
>>>>
>>>> Regards,
>>>> Kazunori INOUE
>>>>
>>>>
>>>>>>     In this case, the node will be set to UNCLEAN if crmd dies.
>>>>>>     That is, the node will be fenced if there is stonith resource.
>>>>>
>>>>>
>>>>>
>>>>> Which is exactly what happens if only pacemakerd is killed with your
>>>>> proposal.
>>>>> Except now you have time to do a graceful pacemaker restart to
>>>>> re-establish the parent-child relationship.
>>>>>
>>>>> If you want to compare B with something, it needs to be with the old
>>>>> "children terminate if pacemakerd dies" strategy.
>>>>> Which is:
>>>>>
>>>>>>      $ service corosync start ; service pacemaker start
>>>>>>      $ pkill -9 pacemakerd
>>>>>>     ... the node will be set to UNCLEAN
>>>>>
>>>>>
>>>>>
>>>>> Old way: always downtime because children terminate which triggers
>>>>> fencing
>>>>> Our way: no downtime unless there is an additional failure (to the cib
>>>>> or
>>>>> crmd)
>>>>>
>>>>> Given that we're trying for HA, the second seems preferable.
>>>>>
>>>>>>
>>>>>>      $ pkill -9 crmd
>>>>>>      $ crm_mon -1
>>>>>>      Last updated: Wed Dec 12 14:53:48 2012
>>>>>>      Last change: Wed Dec 12 14:53:10 2012 via crmd on dev2
>>>>>>
>>>>>>      Stack: corosync
>>>>>>      Current DC: dev2 (2472913088) - partition with quorum
>>>>>>      Version: 1.1.8-3035414
>>>>>>
>>>>>>      2 Nodes configured, unknown expected votes
>>>>>>      0 Resources configured.
>>>>>>
>>>>>>      Node dev1 (2506467520): UNCLEAN (online)
>>>>>>      Online: [ dev2 ]
>>>>>>
>>>>>>
>>>>>> How about making behavior selectable with an option?
>>>>>
>>>>>
>>>>>
>>>>> MORE_DOWNTIME_PLEASE=(true|false) ?
>>>>>
>>>>>>
>>>>>> When pacemakerd dies,
>>>>>> mode A) which behaves in an existing way. (default)
>>>>>> mode B) which makes the node UNCLEAN.
>>>>>>
>>>>>> Best Regards,
>>>>>> Kazunori INOUE
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Making stop work when there is no pacemakerd process is a different
>>>>>>> matter. We can make that work.
>>>>>>>
>>>>>>>>
>>>>>>>> Though the best solution is to relaunch pacemakerd, if it is
>>>>>>>> difficult,
>>>>>>>> I think that a shortcut method is to make a node unclean.
>>>>>>>>
>>>>>>>>
>>>>>>>> And now, I tried Upstart a little bit.
>>>>>>>>
>>>>>>>> 1) started the corosync and pacemaker.
>>>>>>>>
>>>>>>>>      $ cat /etc/init/pacemaker.conf
>>>>>>>>      respawn
>>>>>>>>      script
>>>>>>>>          [ -f /etc/sysconfig/pacemaker ] && {
>>>>>>>>              . /etc/sysconfig/pacemaker
>>>>>>>>          }
>>>>>>>>          exec /usr/sbin/pacemakerd
>>>>>>>>      end script
>>>>>>>>
>>>>>>>>      $ service co start
>>>>>>>>      Starting Corosync Cluster Engine (corosync):               [  OK
>>>>>>>> ]
>>>>>>>>      $ initctl start pacemaker
>>>>>>>>      pacemaker start/running, process 4702
>>>>>>>>
>>>>>>>>
>>>>>>>>      $ ps -ef|egrep 'corosync|pacemaker'
>>>>>>>>      root   4695     1  0 17:21 ?    00:00:00 corosync
>>>>>>>>      root   4702     1  0 17:21 ?    00:00:00 /usr/sbin/pacemakerd
>>>>>>>>      496    4703  4702  0 17:21 ?    00:00:00
>>>>>>>> /usr/libexec/pacemaker/cib
>>>>>>>>      root   4704  4702  0 17:21 ?    00:00:00
>>>>>>>> /usr/libexec/pacemaker/stonithd
>>>>>>>>      root   4705  4702  0 17:21 ?    00:00:00
>>>>>>>> /usr/libexec/pacemaker/lrmd
>>>>>>>>      496    4706  4702  0 17:21 ?    00:00:00
>>>>>>>> /usr/libexec/pacemaker/attrd
>>>>>>>>      496    4707  4702  0 17:21 ?    00:00:00
>>>>>>>> /usr/libexec/pacemaker/pengine
>>>>>>>>      496    4708  4702  0 17:21 ?    00:00:00
>>>>>>>> /usr/libexec/pacemaker/crmd
>>>>>>>>
>>>>>>>> 2) killed pacemakerd.
>>>>>>>>
>>>>>>>>      $ pkill -9 pacemakerd
>>>>>>>>
>>>>>>>>      $ ps -ef|egrep 'corosync|pacemaker'
>>>>>>>>      root   4695     1  0 17:21 ?    00:00:01 corosync
>>>>>>>>      496    4703     1  0 17:21 ?    00:00:00
>>>>>>>> /usr/libexec/pacemaker/cib
>>>>>>>>      root   4704     1  0 17:21 ?    00:00:00
>>>>>>>> /usr/libexec/pacemaker/stonithd
>>>>>>>>      root   4705     1  0 17:21 ?    00:00:00
>>>>>>>> /usr/libexec/pacemaker/lrmd
>>>>>>>>      496    4706     1  0 17:21 ?    00:00:00
>>>>>>>> /usr/libexec/pacemaker/attrd
>>>>>>>>      496    4707     1  0 17:21 ?    00:00:00
>>>>>>>> /usr/libexec/pacemaker/pengine
>>>>>>>>      496    4708     1  0 17:21 ?    00:00:00
>>>>>>>> /usr/libexec/pacemaker/crmd
>>>>>>>>      root   4760     1  1 17:24 ?    00:00:00 /usr/sbin/pacemakerd
>>>>>>>>
>>>>>>>> 3) then I stopped pacemakerd. however, some processes did not stop.
>>>>>>>>
>>>>>>>>      $ initctl stop pacemaker
>>>>>>>>      pacemaker stop/waiting
>>>>>>>>
>>>>>>>>
>>>>>>>>      $ ps -ef|egrep 'corosync|pacemaker'
>>>>>>>>      root   4695     1  0 17:21 ?    00:00:01 corosync
>>>>>>>>      496    4703     1  0 17:21 ?    00:00:00
>>>>>>>> /usr/libexec/pacemaker/cib
>>>>>>>>      root   4704     1  0 17:21 ?    00:00:00
>>>>>>>> /usr/libexec/pacemaker/stonithd
>>>>>>>>      root   4705     1  0 17:21 ?    00:00:00
>>>>>>>> /usr/libexec/pacemaker/lrmd
>>>>>>>>      496    4706     1  0 17:21 ?    00:00:00
>>>>>>>> /usr/libexec/pacemaker/attrd
>>>>>>>>      496    4707     1  0 17:21 ?    00:00:00
>>>>>>>> /usr/libexec/pacemaker/pengine
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>> Kazunori INOUE
>>>>>>>>
>>>>>>>>
>>>>>>>>>>> This isnt the case when the plugin is in use though, but then I'd
>>>>>>>>>>> also
>>>>>>>>>>> have expected most of the processes to die also.
>>>>>>>>>>>
>>>>>>>>>> Since node status will also change if such a result is brought,
>>>>>>>>>> we desire to become so.
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ----
>>>>>>>>>>>> $ cat /etc/redhat-release
>>>>>>>>>>>> Red Hat Enterprise Linux Server release 6.3 (Santiago)
>>>>>>>>>>>>
>>>>>>>>>>>> $ ./configure --sysconfdir=/etc --localstatedir=/var
>>>>>>>>>>>> --without-cman
>>>>>>>>>>>> --without-heartbeat
>>>>>>>>>>>> -snip-
>>>>>>>>>>>> pacemaker configuration:
>>>>>>>>>>>>         Version                  = 1.1.8 (Build: 9c13d14)
>>>>>>>>>>>>         Features                 = generated-manpages
>>>>>>>>>>>> agent-manpages
>>>>>>>>>>>>         ascii-docs
>>>>>>>>>>>> publican-docs ncurses libqb-logging libqb-ipc lha-fencing
>>>>>>>>>>>>       corosync-native
>>>>>>>>>>>> snmp
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> $ cat config.log
>>>>>>>>>>>> -snip-
>>>>>>>>>>>> 6000 | #define BUILD_VERSION "9c13d14"
>>>>>>>>>>>> 6001 | /* end confdefs.h.  */
>>>>>>>>>>>> 6002 | #include <gio/gio.h>
>>>>>>>>>>>> 6003 |
>>>>>>>>>>>> 6004 | int
>>>>>>>>>>>> 6005 | main ()
>>>>>>>>>>>> 6006 | {
>>>>>>>>>>>> 6007 | if (sizeof (GDBusProxy))
>>>>>>>>>>>> 6008 |        return 0;
>>>>>>>>>>>> 6009 |   ;
>>>>>>>>>>>> 6010 |   return 0;
>>>>>>>>>>>> 6011 | }
>>>>>>>>>>>> 6012 configure:32411: result: no
>>>>>>>>>>>> 6013 configure:32417: WARNING: Unable to support systemd/upstart.
>>>>>>>>>>>> You need
>>>>>>>>>>>> to use glib >= 2.26
>>>>>>>>>>>> -snip-
>>>>>>>>>>>> 6286 | #define BUILD_VERSION "9c13d14"
>>>>>>>>>>>> 6287 | #define SUPPORT_UPSTART 0
>>>>>>>>>>>> 6288 | #define SUPPORT_SYSTEMD 0
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>> Kazunori INOUE
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> related bugzilla:
>>>>>>>>>>>>>> http://bugs.clusterlabs.org/show_bug.cgi?id=5064
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>> Kazunori INOUE
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>> Getting started:
>>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org