[Pacemaker] Pacemaker very often STONITHs other node
Michał Margula
alchemyx at uznam.net.pl
Mon Nov 25 15:39:42 UTC 2013
W dniu 25.11.2013 15:44, Digimer pisze:
> My first thought is that the network is congested. That is a lot of
> servers to have on the system. Do you or can you isolate the corosync
> traffic from the drbd traffic?
>
> Personally, I always setup a dedicated network for corosync, another for
> drbd and a third for all traffic to/from the servers. With this, I have
> never had a congestion-based problem.
>
> If possible, please past all logs from both nodes, starting just before
> the stonith occurred until recovery completed please.
>
Hello,
DRBD and CRM go over dedicated link (bonded two gigabit links into one).
It is never saturated nor congested, it barely reaches 300 Mbps in
highest points. I have a separate link for traffic from/to virtual
machines and also separate link to manage nodes (just for SSH, SNMP). I
can isolate corosync to separate link but it could take some time to do.
Now logs...
Trouble started at November 23, 15:14.
Here is a log from "A" node: http://pastebin.com/yM1fqvQ6
Node B: http://pastebin.com/nwbctcgg
Node B is the one that got hit by STONITH. It got killed at 15:18:50. I
have some trouble understanding reasons for that.
Is reason for STONITH that those operations took long time to finish?
Nov 23 15:14:49 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the
operation stop[114] on XEN-piaskownica for client 9529 stayed in
operation list for 24760 ms (longer than 10000 ms)
Nov 23 15:14:50 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the
operation stop[115] on XEN-acsystemy01 for client 9529 stayed in
operation list for 25760 ms (longer than 10000 ms)
Nov 23 15:15:15 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the
operation stop[116] on XEN-frodo for client 9529 stayed in operation
list for 50760 ms (longer than 10000 ms)
But I wonder what in first place made it to stop those virtual machines?
Another clue is here:
Nov 23 15:15:43 rivendell-B lrmd: [9526]: WARN: configuration advice:
reduce operation contention either by increasing lrmd max_children or by
increasing intervals of monitor operations
And here:
coro-A.log:Nov 23 15:14:19 rivendell-A pengine: [8839]: WARN:
unpack_rsc_op: Processing failed op primitive-LVM:1_last_failure_0 on
rivendell-B: not running (7)
But why not running? It is not really a true. Also some trouble with
fencing:
coro-A.log:Nov 23 15:22:25 rivendell-A pengine: [8839]: WARN:
unpack_rsc_op: Processing failed op fencing-of-B_last_failure_0 on
rivendell-A: unknown error (1)
coro-A.log:Nov 23 15:22:25 rivendell-A pengine: [8839]: WARN:
common_apply_stickiness: Forcing fencing-of-B away from rivendell-A
after 1000000 failures (max=1000000)
Thank you!
--
Michał Margula, alchemyx at uznam.net.pl, http://alchemyx.uznam.net.pl/
"W życiu piękne są tylko chwile" [Ryszard Riedel]
More information about the Pacemaker
mailing list