[Pacemaker] Enable remote monitoring

Tue Dec 11 22:14:24 EST 2012

On 12/12/12 01:53, David Vossel wrote:
> ----- Original Message -----
>> From: "Yan Gao" <ygao at suse.com>
>> To: pacemaker at oss.clusterlabs.org
>> Sent: Tuesday, December 11, 2012 1:23:03 AM
>> Subject: Re: [Pacemaker] Enable remote monitoring
>>
>> Hi,
>> Here's the latest code:
>> https://github.com/gao-yan/pacemaker/commit/4d58026c2171c42385c85162a0656c44b37fa7e8
>>
>>
>> Now:
>> - container-type:
>>   * black - ordering, colocating
>>   * white - ordering
>>   Both them are not probed so far.
> 
> I think for the sake of this implementation we should ignore the whitebox use case for now.  There are aspects of the whitebox use case that I'm just not sure about yet, and I don't want to hold you all up trying to define that. I don't mind re-approaching this container concept and expanding it to the whitebox use case later on building with what you have here.  I'm in favor of removing the "container-type" letting the blackbox use case be the default for now, and I'll go in and do our whitebox bits later. 
Hmm, this might be better before we have a clear definition for whitebox.


> It feels like we are at least headed in the right direction with all of this now.
Same feeling to me :-)

> 
>>
>> - on-fail defaults "restart-container" for most actions,
>>
>>   except for stop op (Not sure what it means if a stop fails. A
>>   nagios
>> daemon cannot be terminated? Should it always return success?) ,
> 
> A nagios "stop" action should always return success.  The nagio's agent doesn't even need a stop function, the lrmd can know to treat  a "stop" as a (no-op for stop) + (cancel all recurring actions).  In this case if the nagios agent doesn't stop successfully,  it is because of an lrmd failure which should result in a fencing action i'd imagine.
Makes sense.

> 
>> still
>> defaults to "fence" for it for now.
>>
>> - Failures of resources count against container's
>> migration-threshold.
> 
> What happens if someone wants to clear the container's failcount? Do we need to add some logic to go in and clear all the child resource's failures as well to make this happen correctly?
> 
Which I'm not quite sure also. Since the failcounts of the container and
its children are still shown respectively in the UIs, we should still
allow users to choose any of them to clear alone I think. Probably we
should add an option for crm_resource/crm_failcount to cleanup a
resource as a "container", all failed operations and/or faicounts within
it would be cleaned up also. Of course, we can leave this to shells or
GUIs otherwise.

Regards,
  Gao,Yan
-- 
Gao,Yan <ygao at suse.com>
Software Engineer
China Server Team, SUSE.