[Pacemaker] node status does not change even if pacemakerd dies

Wed Dec 5 09:32:18 UTC 2012

(12.12.05 02:02), David Vossel wrote:
>
>
> ----- Original Message -----
>> From: "Kazunori INOUE" <inouekazu at intellilink.co.jp>
>> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
>> Sent: Monday, December 3, 2012 11:41:56 PM
>> Subject: Re: [Pacemaker] node status does not change even if pacemakerd dies
>>
>> (12.12.03 20:24), Andrew Beekhof wrote:
>>> On Mon, Dec 3, 2012 at 8:15 PM, Kazunori INOUE
>>> <inouekazu at intellilink.co.jp> wrote:
>>>> (12.11.30 23:52), David Vossel wrote:
>>>>>
>>>>> ----- Original Message -----
>>>>>>
>>>>>> From: "Kazunori INOUE" <inouekazu at intellilink.co.jp>
>>>>>> To: "pacemaker at oss" <pacemaker at oss.clusterlabs.org>
>>>>>> Sent: Friday, November 30, 2012 2:38:50 AM
>>>>>> Subject: [Pacemaker] node status does not change even if
>>>>>> pacemakerd dies
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am testing the latest version.
>>>>>> - ClusterLabs/pacemaker  9c13d14640(Nov 27, 2012)
>>>>>> - corosync               92e0f9c7bb(Nov 07, 2012)
>>>>>> - libqb                  30a7871646(Nov 29, 2012)
>>>>>>
>>>>>>
>>>>>> Although I killed pacemakerd, node status did not change.
>>>>>>
>>>>>>     [dev1 ~]$ pkill -9 pacemakerd
>>>>>>     [dev1 ~]$ crm_mon
>>>>>>       :
>>>>>>     Stack: corosync
>>>>>>     Current DC: dev2 (2472913088) - partition with quorum
>>>>>>     Version: 1.1.8-9c13d14
>>>>>>     2 Nodes configured, unknown expected votes
>>>>>>     0 Resources configured.
>>>>>>
>>>>>>
>>>>>>     Online: [ dev1 dev2 ]
>>>>>>
>>>>>>     [dev1 ~]$ ps -ef|egrep 'corosync|pacemaker'
>>>>>>     root     11990     1  1 16:05 ?        00:00:00 corosync
>>>>>>     496      12010     1  0 16:05 ?        00:00:00
>>>>>>     /usr/libexec/pacemaker/cib
>>>>>>     root     12011     1  0 16:05 ?        00:00:00
>>>>>>     /usr/libexec/pacemaker/stonithd
>>>>>>     root     12012     1  0 16:05 ?        00:00:00
>>>>>>     /usr/libexec/pacemaker/lrmd
>>>>>>     496      12013     1  0 16:05 ?        00:00:00
>>>>>>     /usr/libexec/pacemaker/attrd
>>>>>>     496      12014     1  0 16:05 ?        00:00:00
>>>>>>     /usr/libexec/pacemaker/pengine
>>>>>>     496      12015     1  0 16:05 ?        00:00:00
>>>>>>     /usr/libexec/pacemaker/crmd
>>>>>>
>>>>>>
>>>>>> We want the node status to change to
>>>>>> OFFLINE(stonith-enabled=false),
>>>>>> UNCLEAN(stonith-enabled=true).
>>>>>> That is, we want the function of this deleted code.
>>>>>>
>>>>>> https://github.com/ClusterLabs/pacemaker/commit/dfdfb6c9087e644cb898143e198b240eb9a928b4
>>>>>
>>>>>
>>>>> How are you launching pacemakerd?  The systemd service script
>>>>> relaunches
>>>>> pacemakerd on failure and pacemakerd has the ability to attach to
>>>>> all the
>>>>> old processes if they are still around as if nothing happened.
>>>>>
>>>>> -- Vossel
>>>>>
>>>>
>>>> Hi David,
>>>>
>>>> We are using RHEL6 and use it for a while after this.
>>>> Therefore, I start it by the following commands.
>>>>
>>>> $ /etc/init.d/pacemakerd start
>>>> or
>>>> $ service pacemaker start
>>>
>>> Ok.
>>> Are you using the pacemaker plugin?
>>>
>>> When using cman or corosync 2.0, pacemakerd isn't strictly needed
>>> for
>>> normal operation.
>>> Its only there to shutdown and/or respawn failed components.
>>>
>> We are using corosync 2.1,
>> so service does not stop normally after pacemakerd died.
>>
>>    $ pkill -9 pacemakerd
>>    $ service pacemaker stop
>>    $ echo $?
>>    0
>>    $ ps -ef|egrep 'corosync|pacemaker'
>>    root      3807     1  0 13:10 ?        00:00:00 corosync
>>    496       3827     1  0 13:10 ?        00:00:00
>>    /usr/libexec/pacemaker/cib
>>    root      3828     1  0 13:10 ?        00:00:00
>>    /usr/libexec/pacemaker/stonithd
>>    root      3829     1  0 13:10 ?        00:00:00
>>    /usr/libexec/pacemaker/lrmd
>>    496       3830     1  0 13:10 ?        00:00:00
>>    /usr/libexec/pacemaker/attrd
>>    496       3831     1  0 13:10 ?        00:00:00
>>    /usr/libexec/pacemaker/pengine
>>    496       3832     1  0 13:10 ?        00:00:00
>>    /usr/libexec/pacemaker/crmd
>
> Ah yes, that is a problem.
>
> Having pacemaker still running when the init script says it is down... that is bad.  Perhaps we should just make the init script smart enough to check to make sure all the pacemaker components are down after pacemakerd is down.
>
> The argument of whether or not the failure of pacemakerd is something that the cluster should be alerted to is something i'm not sure about.  With the corosync 2.0 stack, pacemakerd really doesn't do anything except launch processes/relaunch processes.  A cluster can be completely functional without a pacemakerd instance running anywhere.  If any of the actual pacemaker components on a node fail, the logic that causes that node to get fenced has nothing to do with pacemakerd.
>
> -- Vossel
>
>

Hi,

I think that "relaunch processes" of pacemakerd is a very useful function,
so I want to avoid management of a resource in the node in which pacemakerd does not exist.

Though the best solution is to relaunch pacemakerd, if it is difficult,
I think that a shortcut method is to make a node unclean.

And now, I tried Upstart a little bit.

1) started the corosync and pacemaker.

  $ cat /etc/init/pacemaker.conf
  respawn
  script
      [ -f /etc/sysconfig/pacemaker ] && {
          . /etc/sysconfig/pacemaker
      }
      exec /usr/sbin/pacemakerd
  end script

  $ service co start
  Starting Corosync Cluster Engine (corosync):               [  OK  ]
  $ initctl start pacemaker
  pacemaker start/running, process 4702

  $ ps -ef|egrep 'corosync|pacemaker'
  root   4695     1  0 17:21 ?    00:00:00 corosync
  root   4702     1  0 17:21 ?    00:00:00 /usr/sbin/pacemakerd
  496    4703  4702  0 17:21 ?    00:00:00 /usr/libexec/pacemaker/cib
  root   4704  4702  0 17:21 ?    00:00:00 /usr/libexec/pacemaker/stonithd
  root   4705  4702  0 17:21 ?    00:00:00 /usr/libexec/pacemaker/lrmd
  496    4706  4702  0 17:21 ?    00:00:00 /usr/libexec/pacemaker/attrd
  496    4707  4702  0 17:21 ?    00:00:00 /usr/libexec/pacemaker/pengine
  496    4708  4702  0 17:21 ?    00:00:00 /usr/libexec/pacemaker/crmd

2) killed pacemakerd.

  $ pkill -9 pacemakerd
  $ ps -ef|egrep 'corosync|pacemaker'
  root   4695     1  0 17:21 ?    00:00:01 corosync
  496    4703     1  0 17:21 ?    00:00:00 /usr/libexec/pacemaker/cib
  root   4704     1  0 17:21 ?    00:00:00 /usr/libexec/pacemaker/stonithd
  root   4705     1  0 17:21 ?    00:00:00 /usr/libexec/pacemaker/lrmd
  496    4706     1  0 17:21 ?    00:00:00 /usr/libexec/pacemaker/attrd
  496    4707     1  0 17:21 ?    00:00:00 /usr/libexec/pacemaker/pengine
  496    4708     1  0 17:21 ?    00:00:00 /usr/libexec/pacemaker/crmd
  root   4760     1  1 17:24 ?    00:00:00 /usr/sbin/pacemakerd

3) then I stopped pacemakerd. however, some processes did not stop.

  $ initctl stop pacemaker
  pacemaker stop/waiting

  $ ps -ef|egrep 'corosync|pacemaker'
  root   4695     1  0 17:21 ?    00:00:01 corosync
  496    4703     1  0 17:21 ?    00:00:00 /usr/libexec/pacemaker/cib
  root   4704     1  0 17:21 ?    00:00:00 /usr/libexec/pacemaker/stonithd
  root   4705     1  0 17:21 ?    00:00:00 /usr/libexec/pacemaker/lrmd
  496    4706     1  0 17:21 ?    00:00:00 /usr/libexec/pacemaker/attrd
  496    4707     1  0 17:21 ?    00:00:00 /usr/libexec/pacemaker/pengine

Best Regards,
Kazunori INOUE

>>> This isnt the case when the plugin is in use though, but then I'd
>>> also
>>> have expected most of the processes to die also.
>>>
>> Since node status will also change if such a result is brought,
>> we desire to become so.
>>
>>>>
>>>> ----
>>>> $ cat /etc/redhat-release
>>>> Red Hat Enterprise Linux Server release 6.3 (Santiago)
>>>>
>>>> $ ./configure --sysconfdir=/etc --localstatedir=/var
>>>> --without-cman
>>>> --without-heartbeat
>>>> -snip-
>>>> pacemaker configuration:
>>>>     Version                  = 1.1.8 (Build: 9c13d14)
>>>>     Features                 = generated-manpages agent-manpages
>>>>     ascii-docs
>>>> publican-docs ncurses libqb-logging libqb-ipc lha-fencing
>>>>   corosync-native
>>>> snmp
>>>>
>>>>
>>>> $ cat config.log
>>>> -snip-
>>>> 6000 | #define BUILD_VERSION "9c13d14"
>>>> 6001 | /* end confdefs.h.  */
>>>> 6002 | #include <gio/gio.h>
>>>> 6003 |
>>>> 6004 | int
>>>> 6005 | main ()
>>>> 6006 | {
>>>> 6007 | if (sizeof (GDBusProxy))
>>>> 6008 |        return 0;
>>>> 6009 |   ;
>>>> 6010 |   return 0;
>>>> 6011 | }
>>>> 6012 configure:32411: result: no
>>>> 6013 configure:32417: WARNING: Unable to support systemd/upstart.
>>>> You need
>>>> to use glib >= 2.26
>>>> -snip-
>>>> 6286 | #define BUILD_VERSION "9c13d14"
>>>> 6287 | #define SUPPORT_UPSTART 0
>>>> 6288 | #define SUPPORT_SYSTEMD 0
>>>>
>>>>
>>>> Best Regards,
>>>> Kazunori INOUE
>>>>
>>>>
>>>>>
>>>>>> related bugzilla:
>>>>>> http://bugs.clusterlabs.org/show_bug.cgi?id=5064
>>>>>>
>>>>>>
>>>>>> Best Regards,
>>>>>> Kazunori INOUE
>>>>>>
>>>>>> _______________________________________________
>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started:
>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started:
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org