[Pacemaker] R: Frequent SBD triggered server reboots

Tue May 7 07:59:49 UTC 2013

Hi,

Here are three logs from the last server watchdog-driven reboot on friday
evening (not that I want you to actually dig into them, it's just to update
this thread with my new findings), with SBD watchdog timeout set to 20
seconds. 

1) sar.txt is the output of sar -d -p- 2 (two seconds frequency of disk
statistics pretty printed), starting right before the reboot

2) messages.txt is an extract of the server /var/log/messages starting right
before the reboot, with QLogic driver, scsi layer and SBD verbose loggings
enabled

3) cpu1.txt is the output of sar -P ALL -2 (two seconds frequency of cpu
statistics), filtered by cpu #1, starting right before the reboot

sda is the local drive, sdb and sdc are the same single SAN LUN as seen by
the two FC ports of the server, san is the LUN multipath alias, san_part1 is
the SBD partition, san_part2 is the Oracle partition.

sar.txt shows that somewhere between 17:46.44 and 17:46.46 all reads and
writes to/from the san LUN drops to zero, for both SBD and Oracle
partitions, right until the 17th second of the SBD countdown, at which time
something (3.88 wr/s) seems to get written on the Oracle partition. 
%util jumps to 100% as it does iowait%, from cpu1.txt, on 3 of the 24 cpu
cores this server has got (the ones Oracle and SBD were using at the time, I
suppose). 

messages.txt shows at 17:46.44 this QLogic driver message that is different
from the rest og QLogic messages:

May  3 17:46:44 server1 kernel: [66588.156113] qla2xxx
[0000:11:00.1]-5816:2: Discard RND Frame -- 1006 02c1 0000.

By the time I started facing these problems, I got gigs of /var/log/messages
from these servers now, and the QLogic driver will write some rare "dropped
frame(s) detected" from time to time during normal server operations, but it
will never write this "Discard RND Frame" message unless there's going to be
an unwanted reboot right after. No scsi layer read and write communication
on sdb and sdc gets recorded by the kernel afterwards, except for a couple
of "device ready" commands. All these info have been shared with the SAN
department already.

Yesterday the SAN department has made a parameter configuration change on
the two Brocade switches (and multipath worked smoothlessly on the servers,
switching paths back and forth as the relative switches got restarted) I
hope this fixes the problem, otherwise we might investigate the switch port
configuration change described in the following link, as our current
configuration seems to apply (8Gb fc, Brocade switches, lots of er_bad_os
port errors, fill word port mode currently set to 1, and random server
problem)  

http://loopbackconnector.com/2013/02/14/brocade-8-gb-how-to-talk-when-idle-p
ortcfgfillword/

andrea

----------------------------------------------------------------------

Message: 1
Date: Fri, 3 May 2013 10:17:12 +0200
From: Lars Marowsky-Bree <lmb at suse.com>
To: The Pacemaker cluster resource manager
	<pacemaker at oss.clusterlabs.org>
Subject: Re: [Pacemaker] Frequent SBD triggered server reboots
Message-ID: <20130503081712.GE3705 at suse.de>
Content-Type: text/plain; charset=iso-8859-1

On 2013-05-03T02:49:54, andrea cuozzo <andrea.cuozzo at sysma.it> wrote:

> Unfortunately Os and SP version for the Oracle project these clusters 
> belong to have been decided several layers over my head, I'll make it 
> a point for upgrading to Sp2 anyway, I might get lucky.

Good luck with that!

> the SAN department investigate their side of the problem, I'll take a 
> look at trying a different stonith resources, all servers involved 
> have some kind of IBM management console. Thanks for your answers to 
> my questions and for your time, very much appreciated.

You're missing out on many further fixes since SP1 went out of support.
Not just to sbd, but everything, from kernel to pacemaker to glibc and back.

Since support is obviously irrelevant to your management, you could consider
recompiling sbd from source if you were so inclined, though.

Regards,
    Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imend?rffer,
HRB 21284 (AG N?rnberg) "Experience is the name everyone gives to their
mistakes." -- Oscar Wilde

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: sar.txt
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130507/e23fecaa/attachment-0009.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: cpu1.txt
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130507/e23fecaa/attachment-0010.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: messages.txt
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130507/e23fecaa/attachment-0011.txt>