[Pacemaker] Pacemaker very often STONITHs other node

Mon Nov 25 15:39:42 UTC 2013

W dniu 25.11.2013 15:44, Digimer pisze:
> My first thought is that the network is congested. That is a lot of
> servers to have on the system. Do you or can you isolate the corosync
> traffic from the drbd traffic?
>
> Personally, I always setup a dedicated network for corosync, another for
> drbd and a third for all traffic to/from the servers. With this, I have
> never had a congestion-based problem.
>
> If possible, please past all logs from both nodes, starting just before
> the stonith occurred until recovery completed please.
>

Hello,

DRBD and CRM go over dedicated link (bonded two gigabit links into one). 
It is never saturated nor congested, it barely reaches 300 Mbps in 
highest points. I have a separate link for traffic from/to virtual 
machines and also separate link to manage nodes (just for SSH, SNMP). I 
can isolate corosync to separate link but it could take some time to do.

Now logs...

Trouble started at November 23, 15:14.
Here is a log from "A" node: http://pastebin.com/yM1fqvQ6
Node B: http://pastebin.com/nwbctcgg

Node B is the one that got hit by STONITH. It got killed at 15:18:50. I 
have some trouble understanding reasons for that.

Is reason for STONITH that those operations took long time to finish?

Nov 23 15:14:49 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the 
operation stop[114] on XEN-piaskownica for client 9529 stayed in 
operation list for 24760 ms (longer than 10000 ms)
Nov 23 15:14:50 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the 
operation stop[115] on XEN-acsystemy01 for client 9529 stayed in 
operation list for 25760 ms (longer than 10000 ms)
Nov 23 15:15:15 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the 
operation stop[116] on XEN-frodo for client 9529 stayed in operation 
list for 50760 ms (longer than 10000 ms)

But I wonder what in first place made it to stop those virtual machines? 
Another clue is here:

Nov 23 15:15:43 rivendell-B lrmd: [9526]: WARN: configuration advice: 
reduce operation contention either by increasing lrmd max_children or by 
increasing intervals of monitor operations

And here:

coro-A.log:Nov 23 15:14:19 rivendell-A pengine: [8839]: WARN: 
unpack_rsc_op: Processing failed op primitive-LVM:1_last_failure_0 on 
rivendell-B: not running (7)

But why not running? It is not really a true. Also some trouble with 
fencing:

coro-A.log:Nov 23 15:22:25 rivendell-A pengine: [8839]: WARN: 
unpack_rsc_op: Processing failed op fencing-of-B_last_failure_0 on 
rivendell-A: unknown error (1)
coro-A.log:Nov 23 15:22:25 rivendell-A pengine: [8839]: WARN: 
common_apply_stickiness: Forcing fencing-of-B away from rivendell-A 
after 1000000 failures (max=1000000)

Thank you!

-- 
Michał Margula, alchemyx at uznam.net.pl, http://alchemyx.uznam.net.pl/
"W życiu piękne są tylko chwile" [Ryszard Riedel]