[Pacemaker] Enable remote monitoring

Fri Dec 7 04:17:03 EST 2012

On Fri, Dec 7, 2012 at 3:19 PM, Gao,Yan <ygao at suse.com> wrote:
> On 12/07/12 12:09, Andrew Beekhof wrote:
>> On Fri, Dec 7, 2012 at 3:00 PM, Gao,Yan <ygao at suse.com> wrote:
>>> On 12/07/12 07:38, Andrew Beekhof wrote:
>>>>
>>>> On 06/12/2012, at 10:42 PM, Lars Marowsky-Bree <lmb at suse.com> wrote:
>>>>
>>>>> On 2012-12-06T22:25:40, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>>>
>>>>>> But any failures of the nagios agents would count against the VM's
>>>>>> migration-threshold.
>>>>>> So if moving were the right thing to do, it would have done it already.
>>>>>
>>>>> OK. I think this was due to me still being stuck on the workings of an
>>>>> order constraint, but of course if the failures are instead attributed
>>>>> to the container, this would happen automatically already. True.
>>>>>
>>>>> (Incidentally, I like "attribute", "ascribe" better than "delegate"
>>>>> because to me, they better fit what's going on, if we sticked with
>>>>> "delegate-failures". Just saying. ;-)
>>>>
>>>> My use of "delegate" comes from my time with ObjectiveC where its common practice to use them for "I'm not going to handle X but here is something that does" style functionality.
>>>> Which fits nicely with what we're doing here.
>>>>
>>>> container="vm"  also works though.
>>>>
>>>>>
>>>>>>> We already have on-fail settings. How would these play together?
>>>>>> Good question. My initial thought was that it would be up to on-fail
>>>>>> settings in the VM.
>>>>>
>>>>> I'd prefer to keep that separate (as proposed below). Because if an
>>>>> action of the *VM* really fails, I may want an admin to look into it
>>>>> (why could the bloody hypervisor not start/stop it?), which is different
>>>>> from restarting the VM if one of the resources within it needs that.
>>>>>
>>>>>>> Would it even make sense to have on-fail="restart-container"? (Or a
>>>>>>> nicer wording.)
>>>>>>>
>>>>>>> Hmmm. That might work. We allow a "container" to be specified as a meta
>>>>>>> attribute.
>>>>>>>
>>>>>>> If set, on-fail would default to restart container for most actions. But
>>>>>>> admins could actually modify it - say, they might want to set
>>>>>>> monitor on-fail="ignore" to just get notified. And when we move forward
>>>>>>> to whiteboxes, we could have start/monitor/promote/demote
>>>>>>> on-fail="restart" (like now) and stop on-fail="restart-container".
>>>>>>>
>>>>>>> That appears reasonably neat?
>>>>>> It does actually.
>>>>>> I wasn't originally thinking it was necessary but it makes sense now
>>>>>> that you point it out.
>>>>>
>>>>> Yes, I think I like this too now.
>>>>>
>>>>> Uhm. Would "container" imply ordering + colocation, or would we still
>>>>> need them grouped (resource_set'ed, whatever)?
>>>>
>>>> Ordering: absolutely
>>> Would any user not like the implied order? Instead want an asymmetrical
>>> or some curious one?
>>
>> Conceptually it doesn't make any sense IMHO.
>> By definition things cant be in/on the container if the container
>> doesn't exist yet.
> Right.
>
>>
>> The one thing we've not addressed yet is probing, thats going to be fun :)
> I guess there should be some way for the nagios RAs to return
> NOT_RUNNING if there's nothing yet, no?

Right, but its talking to an IP address.
Once the guest is up it can be seen from all the nodes, a reprobe
would make it appear to be active _everywhere_.