[Pacemaker] Pacemaker very often STONITHs other node

Mon Nov 25 17:25:23 UTC 2013

On 25/11/13 10:39, Michał Margula wrote:
> W dniu 25.11.2013 15:44, Digimer pisze:
>> My first thought is that the network is congested. That is a lot of
>> servers to have on the system. Do you or can you isolate the corosync
>> traffic from the drbd traffic?
>>
>> Personally, I always setup a dedicated network for corosync, another for
>> drbd and a third for all traffic to/from the servers. With this, I have
>> never had a congestion-based problem.
>>
>> If possible, please past all logs from both nodes, starting just before
>> the stonith occurred until recovery completed please.
>>
> 
> Hello,
> 
> DRBD and CRM go over dedicated link (bonded two gigabit links into one).
> It is never saturated nor congested, it barely reaches 300 Mbps in
> highest points. I have a separate link for traffic from/to virtual
> machines and also separate link to manage nodes (just for SSH, SNMP). I
> can isolate corosync to separate link but it could take some time to do.
> 
> Now logs...
> 
> Trouble started at November 23, 15:14.
> Here is a log from "A" node: http://pastebin.com/yM1fqvQ6
> Node B: http://pastebin.com/nwbctcgg
> 
> Node B is the one that got hit by STONITH. It got killed at 15:18:50. I
> have some trouble understanding reasons for that.
> 
> Is reason for STONITH that those operations took long time to finish?
> 
> Nov 23 15:14:49 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the
> operation stop[114] on XEN-piaskownica for client 9529 stayed in
> operation list for 24760 ms (longer than 10000 ms)
> Nov 23 15:14:50 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the
> operation stop[115] on XEN-acsystemy01 for client 9529 stayed in
> operation list for 25760 ms (longer than 10000 ms)
> Nov 23 15:15:15 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the
> operation stop[116] on XEN-frodo for client 9529 stayed in operation
> list for 50760 ms (longer than 10000 ms)
> 
> But I wonder what in first place made it to stop those virtual machines?
> Another clue is here:
> 
> Nov 23 15:15:43 rivendell-B lrmd: [9526]: WARN: configuration advice:
> reduce operation contention either by increasing lrmd max_children or by
> increasing intervals of monitor operations
> 
> And here:
> 
> coro-A.log:Nov 23 15:14:19 rivendell-A pengine: [8839]: WARN:
> unpack_rsc_op: Processing failed op primitive-LVM:1_last_failure_0 on
> rivendell-B: not running (7)
> 
> But why not running? It is not really a true. Also some trouble with
> fencing:
> 
> coro-A.log:Nov 23 15:22:25 rivendell-A pengine: [8839]: WARN:
> unpack_rsc_op: Processing failed op fencing-of-B_last_failure_0 on
> rivendell-A: unknown error (1)
> coro-A.log:Nov 23 15:22:25 rivendell-A pengine: [8839]: WARN:
> common_apply_stickiness: Forcing fencing-of-B away from rivendell-A
> after 1000000 failures (max=1000000)
> 
> Thank you!
> 

I'd like to see the full logs, starting from a little before the issue
started.

It looks though like, for whatever reason, a stop was called, failed, so
the node was fenced. This would mean that congestion, as you suggested,
is not the likely cause.

Out of curiosity though; what bonding mode are you using? My testing
showed that only mode=1 was reliable. Since I tested, corosync added
support for mode=0 and mode=2, but I've not re-tested them. When I was
doing my bonding tests, I found all other modes to break communications
in some manner of use or failure/recovery testing.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?