[Pacemaker] Enable remote monitoring

Tue Dec 11 21:29:21 EST 2012

On Wed, Dec 12, 2012 at 4:53 AM, David Vossel <dvossel at redhat.com> wrote:
> ----- Original Message -----
>> From: "Yan Gao" <ygao at suse.com>
>> To: pacemaker at oss.clusterlabs.org
>> Sent: Tuesday, December 11, 2012 1:23:03 AM
>> Subject: Re: [Pacemaker] Enable remote monitoring
>>
>> Hi,
>> Here's the latest code:
>> https://github.com/gao-yan/pacemaker/commit/4d58026c2171c42385c85162a0656c44b37fa7e8
>>
>>
>> Now:
>> - container-type:
>>   * black - ordering, colocating
>>   * white - ordering
>>   Both them are not probed so far.
>
> I think for the sake of this implementation we should ignore the whitebox use case for now.  There are aspects of the whitebox use case that I'm just not sure about yet, and I don't want to hold you all up trying to define that. I don't mind re-approaching this container concept and expanding it to the whitebox use case later on building with what you have here.  I'm in favor of removing the "container-type" letting the blackbox use case be the default for now, and I'll go in and do our whitebox bits later.  It feels like we are at least headed in the right direction with all of this now.
>
>>
>> - on-fail defaults "restart-container" for most actions,
>>
>>   except for stop op (Not sure what it means if a stop fails. A
>>   nagios
>> daemon cannot be terminated? Should it always return success?) ,
>
> A nagios "stop" action should always return success.  The nagio's agent doesn't even need a stop function, the lrmd can know to treat  a "stop" as a (no-op for stop) + (cancel all recurring actions).

The lrmd shouldn't need to do this iirc.
The crmd will request all recurring ops be canceled before firing off
the stop action.

> In this case if the nagios agent doesn't stop successfully,  it is because of an lrmd failure which should result in a fencing action i'd imagine.
>
>> still
>> defaults to "fence" for it for now.
>>
>> - Failures of resources count against container's
>> migration-threshold.
>
> What happens if someone wants to clear the container's failcount? Do we need to add some logic to go in and clear all the child resource's failures as well to make this happen correctly?
>
> -- Vossel
>
>> - Also support grouping container with its resources.
>>
>> Please help take a look, and correct me if I missed anything after
>> the
>> tons of discussions. :-)
>>
>> Regards,
>>   Gao,Yan
>> --
>> Gao,Yan <ygao at suse.com>
>> Software Engineer
>> China Server Team, SUSE.
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org