[Pacemaker] Periodically appear non-existent nodes

Mon Apr 30 02:29:34 UTC 2012

On Thu, Apr 19, 2012 at 7:06 PM, Vladislav Bogdanov
<bubble at hoster-ok.com> wrote:
> 19.04.2012 11:24, Andreas Kurz wrote:
>> On 04/18/2012 11:46 PM, ruslan usifov wrote:
>>>
>>>
>>> 2012/4/18 Andreas Kurz <andreas at hastexo.com <mailto:andreas at hastexo.com>>
>>>
>>>     On 04/17/2012 09:31 PM, ruslan usifov wrote:
>>>     >
>>>     >
>>>     > 2012/4/17 Proskurin Kirill <k.proskurin at corp.mail.ru
>>>     <mailto:k.proskurin at corp.mail.ru>
>>>     > <mailto:k.proskurin at corp.mail.ru <mailto:k.proskurin at corp.mail.ru>>>
>>>     >
>>>     >     On 04/17/2012 03:46 PM, ruslan usifov wrote:
>>>     >
>>>     >         2012/4/17 Andreas Kurz <andreas at hastexo.com
>>>     <mailto:andreas at hastexo.com>
>>>     >         <mailto:andreas at hastexo.com <mailto:andreas at hastexo.com>>
>>>     <mailto:andreas at hastexo.com <mailto:andreas at hastexo.com>
>>>     >         <mailto:andreas at hastexo.com <mailto:andreas at hastexo.com>>>>
>>>     >
>>>     >
>>>     >            On 04/14/2012 11:14 PM, ruslan usifov wrote:
>>>     >             > Hello
>>>     >             >
>>>     >             > I remove 2 nodes from cluster, with follow sequence:
>>>     >             >
>>>     >             > crm_node --force -R <id of node1>
>>>     >             > crm_node --force -R <id of node2>
>>>     >             > cibadmin --delete --obj_type nodes --crm_xml '<node
>>>     >         uname="node1"/>'
>>>     >             > cibadmin --delete --obj_type status --crm_xml
>>>     '<node_state
>>>     >            uname="node1"/>'
>>>     >             > cibadmin --delete --obj_type nodes --crm_xml '<node
>>>     >         uname="node2"/>'
>>>     >             > cibadmin --delete --obj_type status --crm_xml
>>>     '<node_state
>>>     >            uname="node2"/>'
>>>     >             >
>>>     >             >
>>>     >             > Nodes after this deleted, but if for example i restart
>>>     >         (reboot)
>>>     >            one of
>>>     >             > existent nodes in working cluster, this deleted nodes
>>>     >         appear again in
>>>     >             > OFFLINE state
>>>     >
>>>     >
>>>     >     I have this problem some time ago.
>>>     >     I "solved" it something like that:
>>>     >
>>>     >     crm node delete NODENAME
>>>     >     crm_node --force --remove NODENAME
>>>     >     cibadmin --delete --obj_type nodes --crm_xml '<node
>>>     uname="NODENAME"/>'
>>>     >     cibadmin --delete --obj_type status --crm_xml '<node_state
>>>     >     uname="NODENAME"/>'
>>>     >
>>>     >     --
>>>     >
>>>     >
>>>     > I do the same, but some times after cluster reconfiguration (node
>>>     failed
>>>     > due power supply failure) removed nodes appear again, and this happens
>>>     > 3-4 times
>>>
>>>     And the same behavior if you switch your cluster into maintenance-mode
>>>     (to avoid service downtime) and stop/start pacemaker and corosync
>>>     completely?
>>>
>>>
>>> We will have maintenance window at this Friday (20.04.2012) so after
>>> that i can report more info.
>>
>> Of course, that is the safest option ... though you won't have a service
>> downtime if you enable maintenance-mode prior to cluster restart.
>
> Unless you are using DLM (CLVM, GFS2, OCFS2). Then you should not stop
> corosync - dlm_controld uses CPG.
>
> And, DLM may use pacemaker parts for fencing (cib, attrd, stonith,
> depending on version).
>
>>
>>>
>>> PS: I had similar situation on other cluster some times ago, and there i
>>> fully restart cluster and problem reproduced. But after some time(about
>>> 1-2 week) not existent nodes have ceased to appear
>>
>> Now that is really strange ... if that happens again, the
>> corosync/pacemaker log files would be really interesting to have a look at.
>
> I recall that is a known issue for a rather long time.
> One need to do a full (not rolling) restart to make node fully disappear.
> I checked this again not so long ago, and yes, node deletion does not
> work with current master branch (or very close to it) - it appears again
> after pacemaker restart on any other node.

Not really enough info do anything about.

>
> May be it is because of lrmd cache, like with failed actions? It looks
> very similar to that.

Nope. The cache is for the local node, if the node is gone so is its cache.