[Pacemaker] Failure after intermittent network outage

Thu Mar 10 12:03:28 UTC 2011

Hi,

No, I think you've missed the point. RA did not answer at all. Monitor 
actions had been lost due to a cluster transition:

Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: do_lrm_rsc_op: 
Performing key=33:1353:7:22dc5497-478f-49ff-b07f-9fcd6da325cd 
op=p-drbd-mdirect1-1:0_monitor_0 )
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: do_lrm_rsc_op: 
Discarding attempt to perform action monitor on p-drbd-mdirect1-1:0 in 
state S_ELECTION
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: send_direct_ack: ACK'ing 
resource op p-drbd-mdirect1-1:0_monitor_0 from 
33:1353:7:22dc5497-478f-49ff-b07f-9fcd6da325cd: 
lrm_invoke-lrmd-1298967360-58
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: process_te_message: 
Processing (N)ACK lrm_invoke-lrmd-1298967360-58 from wapgw1-log
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: process_graph_event: 
Action p-drbd-mdirect1-1:0_monitor_0/33 
(4:99;33:1353:7:22dc5497-478f-49ff-b07f-9fcd6da325cd) initiated by a 
different transitioner
Mar  1 11:16:00 wapgw1-log crmd: [24547]: info: abort_transition_graph: 
process_graph_event:456 - Triggered transition abort (complete=1, 
tag=lrm_rsc_op, id=p-drbd-mdirect1-1:0_monitor_0, 
magic=4:99;33:1353:7:22dc5497-478f-49ff-b07f-9fcd6da325cd) : Foreign event

So, RA had not have a chance to answer anything.

Apart from this, should I fake all RA's which are supposed to be unused 
on the particular nodes in the cluster? It seemes to me like a partial 
solution only.

Suppose that I want to use Virtual machine "X" on hardware nodes A and 
B, and VM "Y" on nodes B and C. Using DRBD, this is very common 
configuration, because "X" cannot access it's disk device on hardware 
node "C". Currently, I must configure "X" and "Y" on every hardware 
node, or RA will fail with status "not configured". It's not 
minimalistic configuration, so it is more error prone than needed.

I would be happy to tell the cluster never to touch resource "X" on node 
C in this case. What do you think?

10.03.2011 14:09, Andrew Beekhof wrote:
> Your basic problem is this...
>
> Mar  1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op:
> Processing failed op vm-mproxy1-1_monitor_0 on wapgw1-log: unknown
> error (1)
>
> We asked what state the resource was in and it replied "arrrggghhhh"
> instead of "not installed".
> Had it replied with not installed, we'd have no reason to call stop or
> fence the node to try and clean it up.

--
Pavel Levshin //flicker