[Pacemaker] Failure after intermittent network outage

Fri Mar 11 05:43:49 EST 2011

On Thu, Mar 10, 2011 at 1:03 PM, Pavel Levshin <pavel at levshin.spb.ru> wrote:
> Hi,
>
> No, I think you've missed the point. RA did not answer at all. Monitor
> actions had been lost due to a cluster transition:

You are incorrect.
While it is true that some actions were NACK's (not lost), such NACKs
do not make it into the CIB and therefor cannot be the cause of logs
such as:

Mar  1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op:
Processing failed op p-drbd-mproxy1-2:0_monitor_0 on wapgw1-log:
unknown error (1)

>
> So, RA had not have a chance to answer anything.

Incorrect.

> Apart from this, should I fake all RA's which are supposed to be unused on
> the particular nodes in the cluster? It seemes to me like a partial solution
> only.

Either remove the RA, or make sure it returns something sensible when
tools or configuration it needs are not available.

>
> Suppose that I want to use Virtual machine "X" on hardware nodes A and B,
> and VM "Y" on nodes B and C. Using DRBD, this is very common configuration,
> because "X" cannot access it's disk device on hardware node "C". Currently,
> I must configure "X" and "Y" on every hardware node, or RA will fail with
> status "not configured". It's not minimalistic configuration, so it is more
> error prone than needed.
>
> I would be happy to tell the cluster never to touch resource "X" on node C
> in this case. What do you think?

No.  For safety we still need to verify that X is not running on node
C before we allow it to be active anywhere else.
That you know the X is unavailable on C is one thing, but the cluster
needs to know too.

>
>
> 10.03.2011 14:09, Andrew Beekhof wrote:
>>
>> Your basic problem is this...
>>
>> Mar  1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op:
>> Processing failed op vm-mproxy1-1_monitor_0 on wapgw1-log: unknown
>> error (1)
>>
>> We asked what state the resource was in and it replied "arrrggghhhh"
>> instead of "not installed".
>> Had it replied with not installed, we'd have no reason to call stop or
>> fence the node to try and clean it up.
>
>
> --
> Pavel Levshin //flicker
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>