[Pacemaker] Periodically appear non-existent nodes

Thu Apr 19 11:37:54 CEST 2012

On 04/19/2012 11:06 AM, Vladislav Bogdanov wrote:
> 19.04.2012 11:24, Andreas Kurz wrote:
>> On 04/18/2012 11:46 PM, ruslan usifov wrote:
>>>
>>>
>>> 2012/4/18 Andreas Kurz <andreas at hastexo.com <mailto:andreas at hastexo.com>>
>>>
>>>     On 04/17/2012 09:31 PM, ruslan usifov wrote:
>>>     >
>>>     >
>>>     > 2012/4/17 Proskurin Kirill <k.proskurin at corp.mail.ru
>>>     <mailto:k.proskurin at corp.mail.ru>
>>>     > <mailto:k.proskurin at corp.mail.ru <mailto:k.proskurin at corp.mail.ru>>>
>>>     >
>>>     >     On 04/17/2012 03:46 PM, ruslan usifov wrote:
>>>     >
>>>     >         2012/4/17 Andreas Kurz <andreas at hastexo.com
>>>     <mailto:andreas at hastexo.com>
>>>     >         <mailto:andreas at hastexo.com <mailto:andreas at hastexo.com>>
>>>     <mailto:andreas at hastexo.com <mailto:andreas at hastexo.com>
>>>     >         <mailto:andreas at hastexo.com <mailto:andreas at hastexo.com>>>>
>>>     >
>>>     >
>>>     >            On 04/14/2012 11:14 PM, ruslan usifov wrote:
>>>     >             > Hello
>>>     >             >
>>>     >             > I remove 2 nodes from cluster, with follow sequence:
>>>     >             >
>>>     >             > crm_node --force -R <id of node1>
>>>     >             > crm_node --force -R <id of node2>
>>>     >             > cibadmin --delete --obj_type nodes --crm_xml '<node
>>>     >         uname="node1"/>'
>>>     >             > cibadmin --delete --obj_type status --crm_xml
>>>     '<node_state
>>>     >            uname="node1"/>'
>>>     >             > cibadmin --delete --obj_type nodes --crm_xml '<node
>>>     >         uname="node2"/>'
>>>     >             > cibadmin --delete --obj_type status --crm_xml
>>>     '<node_state
>>>     >            uname="node2"/>'
>>>     >             >
>>>     >             >
>>>     >             > Nodes after this deleted, but if for example i restart
>>>     >         (reboot)
>>>     >            one of
>>>     >             > existent nodes in working cluster, this deleted nodes
>>>     >         appear again in
>>>     >             > OFFLINE state
>>>     >
>>>     >
>>>     >     I have this problem some time ago.
>>>     >     I "solved" it something like that:
>>>     >
>>>     >     crm node delete NODENAME
>>>     >     crm_node --force --remove NODENAME
>>>     >     cibadmin --delete --obj_type nodes --crm_xml '<node
>>>     uname="NODENAME"/>'
>>>     >     cibadmin --delete --obj_type status --crm_xml '<node_state
>>>     >     uname="NODENAME"/>'
>>>     >
>>>     >     --
>>>     >
>>>     >
>>>     > I do the same, but some times after cluster reconfiguration (node
>>>     failed
>>>     > due power supply failure) removed nodes appear again, and this happens
>>>     > 3-4 times
>>>
>>>     And the same behavior if you switch your cluster into maintenance-mode
>>>     (to avoid service downtime) and stop/start pacemaker and corosync
>>>     completely?
>>>
>>>
>>> We will have maintenance window at this Friday (20.04.2012) so after
>>> that i can report more info.
>>
>> Of course, that is the safest option ... though you won't have a service
>> downtime if you enable maintenance-mode prior to cluster restart.
> 
> Unless you are using DLM (CLVM, GFS2, OCFS2). Then you should not stop
> corosync - dlm_controld uses CPG.
> 
> And, DLM may use pacemaker parts for fencing (cib, attrd, stonith,
> depending on version).

Yes, of course ... that won't work if you are using dlm. Thanks for
pointing that out explicitly, Vladislav ... and to have it now here in
the ml archive for the records ;-)

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
>>
>>>
>>> PS: I had similar situation on other cluster some times ago, and there i
>>> fully restart cluster and problem reproduced. But after some time(about
>>> 1-2 week) not existent nodes have ceased to appear
>>
>> Now that is really strange ... if that happens again, the
>> corosync/pacemaker log files would be really interesting to have a look at.
> 
> I recall that is a known issue for a rather long time.
> One need to do a full (not rolling) restart to make node fully disappear.
> I checked this again not so long ago, and yes, node deletion does not
> work with current master branch (or very close to it) - it appears again
> after pacemaker restart on any other node.
> 
> May be it is because of lrmd cache, like with failed actions? It looks
> very similar to that.
> 
> Andrew, David?
> 
> Best,
> Vladislav
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 222 bytes
Desc: OpenPGP digital signature
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20120419/6f8bf3e3/attachment.sig>