[Pacemaker] Pacemaker very often STONITHs other node

Mon Nov 25 14:44:58 UTC 2013

On 25/11/13 06:40, Michał Margula wrote:
> Hello!
> 
> I wanted to ask for your help because we are having much trouble with
> cluster based on Pacemaker.
> 
> We have two identical nodes - PowerEdge R510 with 2x Xeon X5650, 64 GB
> of RAM, MegaRAID SAS 2108 RAID (PERC H700) - system disk - RAID 1 on
> SSDs (SSDSC2CW060A3) and two volumes - one RAID 1 with WD3000FYYZ and
> one RAID 1 with WD1002FBYS -- both Western Digital disks. Both nodes are
> linked with two gigabit direct fiber links (no switch in between).
> 
> We have two DRBD volumes - /dev/drbd1 (1TB on WD1002FBYS disks) and
> /dev/drbd2 (3TB on WD3000FYYZ disks). On top of DRBD (used as PVs) we
> have a LVM with LVs for virtual machines which run under XEN.
> 
> Here is our CRM configuration - http://pastebin.com/raqsvRTA
> 
> We have previously used fast USB drives instead of SSD for root
> filesystem and it caused some trouble - it was lagging on I/O and one
> node "thought" that another one was having trouble and performing
> STONITH on it. After replacing it with SSDs we had no more trouble with
> that issue.
> 
> But now from time to time it happens that we get STONITH of one nodes,
> and reason is unclear to us.
> 
> For example last time we found it in logs:
> 
> Nov 23 15:14:24 rivendell-B crmd: [9529]: info: process_lrm_event: LRM
> operation primitive-LVM:1_monitor_120000 (call=54, rc=7, cib-update=124,
> confirmed=false) not running
> 
> And after that node rivendell-B got STONITH. Previously we had trouble
> with DRBD - node stopped DRBD for no apparent reason and again -
> STONITH. Unfortunately we did not check logs that time.
> 
> Also when doing some tasks on one of nodes (for example "crm resource
> migrate" of few XEN virtual machines) it can cause STONITH also.
> 
> Could you give us some hints? Maybe our configuration is wrong? To be
> honest we had no previous experience with HA clusters so we created it
> based on configuration.
> 
> It is working now for over a year now but giving us headaches and we are
> wondering if we should drop Pacemaker and use something else (even
> manual stopping and starting of virtual machines comes in mind).
> 
> Thank you in advance!

My first thought is that the network is congested. That is a lot of
servers to have on the system. Do you or can you isolate the corosync
traffic from the drbd traffic?

Personally, I always setup a dedicated network for corosync, another for
drbd and a third for all traffic to/from the servers. With this, I have
never had a congestion-based problem.

If possible, please past all logs from both nodes, starting just before
the stonith occurred until recovery completed please.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?