[Pacemaker] Pacemaker very often STONITHs other node

Michał Margula alchemyx at uznam.net.pl
Mon Nov 25 06:40:54 EST 2013


Hello!

I wanted to ask for your help because we are having much trouble with 
cluster based on Pacemaker.

We have two identical nodes - PowerEdge R510 with 2x Xeon X5650, 64 GB 
of RAM, MegaRAID SAS 2108 RAID (PERC H700) - system disk - RAID 1 on 
SSDs (SSDSC2CW060A3) and two volumes - one RAID 1 with WD3000FYYZ and 
one RAID 1 with WD1002FBYS -- both Western Digital disks. Both nodes are 
linked with two gigabit direct fiber links (no switch in between).

We have two DRBD volumes - /dev/drbd1 (1TB on WD1002FBYS disks) and 
/dev/drbd2 (3TB on WD3000FYYZ disks). On top of DRBD (used as PVs) we 
have a LVM with LVs for virtual machines which run under XEN.

Here is our CRM configuration - http://pastebin.com/raqsvRTA

We have previously used fast USB drives instead of SSD for root 
filesystem and it caused some trouble - it was lagging on I/O and one 
node "thought" that another one was having trouble and performing 
STONITH on it. After replacing it with SSDs we had no more trouble with 
that issue.

But now from time to time it happens that we get STONITH of one nodes, 
and reason is unclear to us.

For example last time we found it in logs:

Nov 23 15:14:24 rivendell-B crmd: [9529]: info: process_lrm_event: LRM 
operation primitive-LVM:1_monitor_120000 (call=54, rc=7, cib-update=124, 
confirmed=false) not running

And after that node rivendell-B got STONITH. Previously we had trouble 
with DRBD - node stopped DRBD for no apparent reason and again - 
STONITH. Unfortunately we did not check logs that time.

Also when doing some tasks on one of nodes (for example "crm resource 
migrate" of few XEN virtual machines) it can cause STONITH also.

Could you give us some hints? Maybe our configuration is wrong? To be 
honest we had no previous experience with HA clusters so we created it 
based on configuration.

It is working now for over a year now but giving us headaches and we are 
wondering if we should drop Pacemaker and use something else (even 
manual stopping and starting of virtual machines comes in mind).

Thank you in advance!

-- 
Michał Margula, alchemyx at uznam.net.pl, http://alchemyx.uznam.net.pl/
"W życiu piękne są tylko chwile" [Ryszard Riedel]




More information about the Pacemaker mailing list