[Pacemaker] node status does not change even if pacemakerd dies

Thu Jan 10 04:35:57 UTC 2013

On Wed, Jan 9, 2013 at 8:57 PM, Kazunori INOUE
<inouekazu at intellilink.co.jp> wrote:
> Hi Andrew,
>
> I have another question about this subject.
> Even if pengine, stonithd, and attrd crash after pacemakerd is killed
> (for example, killed by OOM_Killer), node status does not change.
>
> * pseudo testcase
>
>  [dev1 ~]$ crm configure show
>  node $id="2472913088" dev2
>  node $id="2506467520" dev1
>  primitive prmDummy ocf:pacemaker:Dummy \
>          op monitor on-fail="restart" interval="10s"
>  property $id="cib-bootstrap-options" \
>          dc-version="1.1.8-d20d06f" \
>          cluster-infrastructure="corosync" \
>          no-quorum-policy="ignore" \
>          stonith-enabled="false" \
>          startup-fencing="false"
>  rsc_defaults $id="rsc-options" \
>          resource-stickiness="INFINITY" \
>          migration-threshold="1"
>
>
>  [dev1 ~]$ pkill -9 pacemakerd
>  [dev1 ~]$ pkill -9 pengine
>  [dev1 ~]$ pkill -9 stonithd
>  [dev1 ~]$ pkill -9 attrd

>From http://linux-mm.org/OOM_Killer

 * 2) we recover a large amount of memory
 * 3) we don't kill anything innocent of eating tons of memory
 * 4) we want to kill the minimum amount of processes (one)

pacemakerd doesn't meet any of these criteria and is probably the last
process that would ever be killed.
It uses orders of magnitude less memory than corosync and the cib for
example - so those would be among the first to go.

The order you'd need to kill things to match the OOM killer is:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
20319 root      RT   0  409m  85m  58m S  0.0 17.4   0:14.45 corosync
20611 hacluste  20   0  115m  19m  17m S  0.0  4.0   0:02.85 pengine
20607 hacluste  20   0 97908  12m 9572 S  0.0  2.6   0:03.45 cib
20612 root      20   0  151m  11m 9568 S  0.0  2.3   0:03.02 crmd
20608 root      20   0 92036 8832 7636 S  0.0  1.8   0:02.22 stonithd
20609 root      20   0 73216 3180 2420 S  0.0  0.6   0:02.88 lrmd
20610 hacluste  20   0 85868 3120 2356 S  0.0  0.6   0:02.21 attrd
20601 root      20   0 80356 2960 2232 S  0.0  0.6   0:02.98
pacemakerd

So you can't just "kill -9" a specific combination of processes and
say "OOM Killer" to make it a plausible test case.
Also, with stonith disabled, this scenario is honestly the least of
your problems.

HOWEVER...

As long as the cib, lrmd, and crmd are around, the cluster, while
degraded, is still able to perform its primary functions (start/stop
processes and do health checks).
So not sending it offline is reasonable.  If you had done this on the
DC you would have seen a different result.

The question I ask in these cases is, "what do we gain by having
pacemaker exit?".
Particularly with stonith turned off, the answer here is worse than nothing...
At best you have the services running on a node without pacemaker, at
worst the cluster starts them on the second node as well.

Reporting the node as healthy however, is clearly not good.  We
absolutely need to mark it as degraded somehow.

David and I talked this morning about potentially putting the node
automatically into standby (it can still probe for services in that
state) if certain processes die as well as ensuring it never wins a DC
election.
These are the things I would prefer to invest time into rather than
always resorting to the exit(1) hammer.

Restarting for every error is something that was only ever meant to be
temporary, note the creation date on:
   https://developerbugs.linuxfoundation.org/show_bug.cgi?id=66

>
>  [dev1 ~]$ ps -ef|egrep 'corosync|pacemaker'
>  root   19124    1  0 14:27 ?     00:00:01 corosync
>  496    19144    1  0 14:27 ?     00:00:00 /usr/libexec/pacemaker/cib
>  root   19146    1  0 14:27 ?     00:00:00 /usr/libexec/pacemaker/lrmd
>  496    19149    1  0 14:27 ?     00:00:00 /usr/libexec/pacemaker/crmd
>
>  [dev1 ~]$ crm_mon -1
>
>   :
>  Stack: corosync
>  Current DC: dev2 (2472913088) - partition with quorum
>  Version: 1.1.8-d20d06f
>
>  2 Nodes configured, unknown expected votes
>  1 Resources configured.
>
>
>  Online: [ dev1 dev2 ]
>
>   prmDummy       (ocf::pacemaker:Dummy): Started dev1
>
> Node (dev1) remains Online.
> When other processes such as lrmd crash, it becomes "UNCLEAN (offline)".
> Is this a bug? Or specifications?
>
> Best Regards,
> Kazunori INOUE
>
>
>
> (13.01.08 09:16), Andrew Beekhof wrote:
>>
>> On Wed, Dec 19, 2012 at 8:15 PM, Kazunori INOUE
>> <inouekazu at intellilink.co.jp> wrote:
>>>
>>> (12.12.13 08:26), Andrew Beekhof wrote:
>>>>
>>>>
>>>> On Wed, Dec 12, 2012 at 8:02 PM, Kazunori INOUE
>>>> <inouekazu at intellilink.co.jp> wrote:
>>>>>
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> I recognize that pacemakerd is much less likely to crash.
>>>>> However, a possibility of being killed by OOM_Killer etc. is not 0%.
>>>>
>>>>
>>>>
>>>> True.  Although we just established in another thread that we don't
>>>> have any leaks :)
>>>>
>>>>> So I think that a user gets confused. since behavior at the time of
>>>>> process
>>>>> death differs even if pacemakerd is running.
>>>>>
>>>>> case A)
>>>>>    When pacemakerd and other processes (crmd etc.) are the parent-child
>>>>> relation.
>>>>>
>>>>
>>>> [snip]
>>>>
>>>>>
>>>>>    For example, crmd died.
>>>>>    However, since it is relaunched, the state of the cluster is not
>>>>> affected.
>>>>
>>>>
>>>>
>>>> Right.
>>>>
>>>> [snip]
>>>>
>>>>>
>>>>> case B)
>>>>>    When pacemakerd and other processes are NOT the parent-child
>>>>> relation.
>>>>>    Although pacemakerd was killed, it assumed the state where it was
>>>>> respawned
>>>>> by Upstart.
>>>>>
>>>>>     $ service corosync start ; service pacemaker start
>>>>>     $ pkill -9 pacemakerd
>>>>>     $ ps -ef|egrep 'corosync|pacemaker|UID'
>>>>>     UID      PID  PPID  C STIME TTY       TIME CMD
>>>>>     root   21091     1  1 14:52 ?     00:00:00 corosync
>>>>>     496    21099     1  0 14:52 ?     00:00:00
>>>>> /usr/libexec/pacemaker/cib
>>>>>     root   21100     1  0 14:52 ?     00:00:00
>>>>> /usr/libexec/pacemaker/stonithd
>>>>>     root   21101     1  0 14:52 ?     00:00:00
>>>>> /usr/libexec/pacemaker/lrmd
>>>>>     496    21102     1  0 14:52 ?     00:00:00
>>>>> /usr/libexec/pacemaker/attrd
>>>>>     496    21103     1  0 14:52 ?     00:00:00
>>>>> /usr/libexec/pacemaker/pengine
>>>>>     496    21104     1  0 14:52 ?     00:00:00
>>>>> /usr/libexec/pacemaker/crmd
>>>>>     root   21128     1  1 14:53 ?     00:00:00 /usr/sbin/pacemakerd
>>>>
>>>>
>>>>
>>>> Yep, looks right.
>>>>
>>>
>>> Hi Andrew,
>>>
>>> We discussed this behavior.
>>> Behavior when pacemakerd and other processes are not parent-child
>>> relation (case B) reached the conclusion that there is room for
>>> improvement.
>>>
>>> Since not all users are experts, they may kill pacemakerd accidentally.
>>> Such a user will get confused if the behavior after crmd death changes
>>> with the following conditions.
>>> case A: pacemakerd and others (crmd etc.) are the parent-child relation.
>>> case B: pacemakerd and others are not the parent-child relation.
>>>
>>> So, we want to *always* obtain the same behavior as the case where
>>> there is parent-child relation.
>>> That is, when crmd etc. die, we want pacemaker to always relaunch
>>> the process always immediately.
>>
>>
>> No. Sorry.
>> Writing features to satisfy an artificial test case is not a good
>> practice.
>>
>> We can speed up the failure detection for case B (I'll agree that 60s
>> is way too long, 5s or 2s might be better depending on the load is
>> creates), but causing downtime now to _maybe_ avoid downtime in the
>> future makes no sense.
>> Especially when you consider that the node will likely be fenced if
>> the crmd fails anyway.
>>
>> Take a look at the logs from a some ComponentFail test runs and you'll
>> see that the parent-child relationship regularly _fails_ to prevent
>> downtime.
>>
>>>
>>> Regards,
>>> Kazunori INOUE
>>>
>>>
>>>>>    In this case, the node will be set to UNCLEAN if crmd dies.
>>>>>    That is, the node will be fenced if there is stonith resource.
>>>>
>>>>
>>>>
>>>> Which is exactly what happens if only pacemakerd is killed with your
>>>> proposal.
>>>> Except now you have time to do a graceful pacemaker restart to
>>>> re-establish the parent-child relationship.
>>>>
>>>> If you want to compare B with something, it needs to be with the old
>>>> "children terminate if pacemakerd dies" strategy.
>>>> Which is:
>>>>
>>>>>     $ service corosync start ; service pacemaker start
>>>>>     $ pkill -9 pacemakerd
>>>>>    ... the node will be set to UNCLEAN
>>>>
>>>>
>>>>
>>>> Old way: always downtime because children terminate which triggers
>>>> fencing
>>>> Our way: no downtime unless there is an additional failure (to the cib
>>>> or
>>>> crmd)
>>>>
>>>> Given that we're trying for HA, the second seems preferable.
>>>>
>>>>>
>>>>>     $ pkill -9 crmd
>>>>>     $ crm_mon -1
>>>>>     Last updated: Wed Dec 12 14:53:48 2012
>>>>>     Last change: Wed Dec 12 14:53:10 2012 via crmd on dev2
>>>>>
>>>>>     Stack: corosync
>>>>>     Current DC: dev2 (2472913088) - partition with quorum
>>>>>     Version: 1.1.8-3035414
>>>>>
>>>>>     2 Nodes configured, unknown expected votes
>>>>>     0 Resources configured.
>>>>>
>>>>>     Node dev1 (2506467520): UNCLEAN (online)
>>>>>     Online: [ dev2 ]
>>>>>
>>>>>
>>>>> How about making behavior selectable with an option?
>>>>
>>>>
>>>>
>>>> MORE_DOWNTIME_PLEASE=(true|false) ?
>>>>
>>>>>
>>>>> When pacemakerd dies,
>>>>> mode A) which behaves in an existing way. (default)
>>>>> mode B) which makes the node UNCLEAN.
>>>>>
>>>>> Best Regards,
>>>>> Kazunori INOUE
>>>>>
>>>>>
>>>>>
>>>>>> Making stop work when there is no pacemakerd process is a different
>>>>>> matter. We can make that work.
>>>>>>
>>>>>>>
>>>>>>> Though the best solution is to relaunch pacemakerd, if it is
>>>>>>> difficult,
>>>>>>> I think that a shortcut method is to make a node unclean.
>>>>>>>
>>>>>>>
>>>>>>> And now, I tried Upstart a little bit.
>>>>>>>
>>>>>>> 1) started the corosync and pacemaker.
>>>>>>>
>>>>>>>     $ cat /etc/init/pacemaker.conf
>>>>>>>     respawn
>>>>>>>     script
>>>>>>>         [ -f /etc/sysconfig/pacemaker ] && {
>>>>>>>             . /etc/sysconfig/pacemaker
>>>>>>>         }
>>>>>>>         exec /usr/sbin/pacemakerd
>>>>>>>     end script
>>>>>>>
>>>>>>>     $ service co start
>>>>>>>     Starting Corosync Cluster Engine (corosync):               [  OK
>>>>>>> ]
>>>>>>>     $ initctl start pacemaker
>>>>>>>     pacemaker start/running, process 4702
>>>>>>>
>>>>>>>
>>>>>>>     $ ps -ef|egrep 'corosync|pacemaker'
>>>>>>>     root   4695     1  0 17:21 ?    00:00:00 corosync
>>>>>>>     root   4702     1  0 17:21 ?    00:00:00 /usr/sbin/pacemakerd
>>>>>>>     496    4703  4702  0 17:21 ?    00:00:00
>>>>>>> /usr/libexec/pacemaker/cib
>>>>>>>     root   4704  4702  0 17:21 ?    00:00:00
>>>>>>> /usr/libexec/pacemaker/stonithd
>>>>>>>     root   4705  4702  0 17:21 ?    00:00:00
>>>>>>> /usr/libexec/pacemaker/lrmd
>>>>>>>     496    4706  4702  0 17:21 ?    00:00:00
>>>>>>> /usr/libexec/pacemaker/attrd
>>>>>>>     496    4707  4702  0 17:21 ?    00:00:00
>>>>>>> /usr/libexec/pacemaker/pengine
>>>>>>>     496    4708  4702  0 17:21 ?    00:00:00
>>>>>>> /usr/libexec/pacemaker/crmd
>>>>>>>
>>>>>>> 2) killed pacemakerd.
>>>>>>>
>>>>>>>     $ pkill -9 pacemakerd
>>>>>>>
>>>>>>>     $ ps -ef|egrep 'corosync|pacemaker'
>>>>>>>     root   4695     1  0 17:21 ?    00:00:01 corosync
>>>>>>>     496    4703     1  0 17:21 ?    00:00:00
>>>>>>> /usr/libexec/pacemaker/cib
>>>>>>>     root   4704     1  0 17:21 ?    00:00:00
>>>>>>> /usr/libexec/pacemaker/stonithd
>>>>>>>     root   4705     1  0 17:21 ?    00:00:00
>>>>>>> /usr/libexec/pacemaker/lrmd
>>>>>>>     496    4706     1  0 17:21 ?    00:00:00
>>>>>>> /usr/libexec/pacemaker/attrd
>>>>>>>     496    4707     1  0 17:21 ?    00:00:00
>>>>>>> /usr/libexec/pacemaker/pengine
>>>>>>>     496    4708     1  0 17:21 ?    00:00:00
>>>>>>> /usr/libexec/pacemaker/crmd
>>>>>>>     root   4760     1  1 17:24 ?    00:00:00 /usr/sbin/pacemakerd
>>>>>>>
>>>>>>> 3) then I stopped pacemakerd. however, some processes did not stop.
>>>>>>>
>>>>>>>     $ initctl stop pacemaker
>>>>>>>     pacemaker stop/waiting
>>>>>>>
>>>>>>>
>>>>>>>     $ ps -ef|egrep 'corosync|pacemaker'
>>>>>>>     root   4695     1  0 17:21 ?    00:00:01 corosync
>>>>>>>     496    4703     1  0 17:21 ?    00:00:00
>>>>>>> /usr/libexec/pacemaker/cib
>>>>>>>     root   4704     1  0 17:21 ?    00:00:00
>>>>>>> /usr/libexec/pacemaker/stonithd
>>>>>>>     root   4705     1  0 17:21 ?    00:00:00
>>>>>>> /usr/libexec/pacemaker/lrmd
>>>>>>>     496    4706     1  0 17:21 ?    00:00:00
>>>>>>> /usr/libexec/pacemaker/attrd
>>>>>>>     496    4707     1  0 17:21 ?    00:00:00
>>>>>>> /usr/libexec/pacemaker/pengine
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Kazunori INOUE
>>>>>>>
>>>>>>>
>>>>>>>>>> This isnt the case when the plugin is in use though, but then I'd
>>>>>>>>>> also
>>>>>>>>>> have expected most of the processes to die also.
>>>>>>>>>>
>>>>>>>>> Since node status will also change if such a result is brought,
>>>>>>>>> we desire to become so.
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ----
>>>>>>>>>>> $ cat /etc/redhat-release
>>>>>>>>>>> Red Hat Enterprise Linux Server release 6.3 (Santiago)
>>>>>>>>>>>
>>>>>>>>>>> $ ./configure --sysconfdir=/etc --localstatedir=/var
>>>>>>>>>>> --without-cman
>>>>>>>>>>> --without-heartbeat
>>>>>>>>>>> -snip-
>>>>>>>>>>> pacemaker configuration:
>>>>>>>>>>>        Version                  = 1.1.8 (Build: 9c13d14)
>>>>>>>>>>>        Features                 = generated-manpages
>>>>>>>>>>> agent-manpages
>>>>>>>>>>>        ascii-docs
>>>>>>>>>>> publican-docs ncurses libqb-logging libqb-ipc lha-fencing
>>>>>>>>>>>      corosync-native
>>>>>>>>>>> snmp
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> $ cat config.log
>>>>>>>>>>> -snip-
>>>>>>>>>>> 6000 | #define BUILD_VERSION "9c13d14"
>>>>>>>>>>> 6001 | /* end confdefs.h.  */
>>>>>>>>>>> 6002 | #include <gio/gio.h>
>>>>>>>>>>> 6003 |
>>>>>>>>>>> 6004 | int
>>>>>>>>>>> 6005 | main ()
>>>>>>>>>>> 6006 | {
>>>>>>>>>>> 6007 | if (sizeof (GDBusProxy))
>>>>>>>>>>> 6008 |        return 0;
>>>>>>>>>>> 6009 |   ;
>>>>>>>>>>> 6010 |   return 0;
>>>>>>>>>>> 6011 | }
>>>>>>>>>>> 6012 configure:32411: result: no
>>>>>>>>>>> 6013 configure:32417: WARNING: Unable to support systemd/upstart.
>>>>>>>>>>> You need
>>>>>>>>>>> to use glib >= 2.26
>>>>>>>>>>> -snip-
>>>>>>>>>>> 6286 | #define BUILD_VERSION "9c13d14"
>>>>>>>>>>> 6287 | #define SUPPORT_UPSTART 0
>>>>>>>>>>> 6288 | #define SUPPORT_SYSTEMD 0
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Best Regards,
>>>>>>>>>>> Kazunori INOUE
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> related bugzilla:
>>>>>>>>>>>>> http://bugs.clusterlabs.org/show_bug.cgi?id=5064
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>> Kazunori INOUE
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>
>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>> Getting started:
>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org