<div dir="ltr"><div>Hello Andrea<br><br></div>i think you need to think about that Lars told you = (Upgrade to SP2) or maybe you can try to use a diferent lun for the sbd and use ionice for setting the realtime class for sbd process<br>

<br><br></div><div class="gmail_extra"><br><br><div class="gmail_quote">2013/5/7 andrea cuozzo <span dir="ltr">&lt;<a href="mailto:andrea.cuozzo@sysma.it" target="_blank">andrea.cuozzo@sysma.it</a>&gt;</span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi,<br>

<br>

Here are three logs from the last server watchdog-driven reboot on friday<br>

evening (not that I want you to actually dig into them, it&#39;s just to update<br>

this thread with my new findings), with SBD watchdog timeout set to 20<br>

seconds.<br>

<br>

1) sar.txt is the output of sar -d -p- 2 (two seconds frequency of disk<br>

statistics pretty printed), starting right before the reboot<br>

<br>

2) messages.txt is an extract of the server /var/log/messages starting right<br>

before the reboot, with QLogic driver, scsi layer and SBD verbose loggings<br>

enabled<br>

<br>

3) cpu1.txt is the output of sar -P ALL -2 (two seconds frequency of cpu<br>

statistics), filtered by cpu #1, starting right before the reboot<br>

<br>

sda is the local drive, sdb and sdc are the same single SAN LUN as seen by<br>

the two FC ports of the server, san is the LUN multipath alias, san_part1 is<br>

the SBD partition, san_part2 is the Oracle partition.<br>

<br>

sar.txt shows that somewhere between 17:46.44 and 17:46.46 all reads and<br>

writes to/from the san LUN drops to zero, for both SBD and Oracle<br>

partitions, right until the 17th second of the SBD countdown, at which time<br>

something (3.88 wr/s) seems to get written on the Oracle partition.<br>

%util jumps to 100% as it does iowait%, from cpu1.txt, on 3 of the 24 cpu<br>

cores this server has got (the ones Oracle and SBD were using at the time, I<br>

suppose).<br>

<br>

messages.txt shows at 17:46.44 this QLogic driver message that is different<br>

from the rest og QLogic messages:<br>

<br>

May  3 17:46:44 server1 kernel: [66588.156113] qla2xxx<br>

[0000:11:00.1]-5816:2: Discard RND Frame -- 1006 02c1 0000.<br>

<br>

By the time I started facing these problems, I got gigs of /var/log/messages<br>

from these servers now, and the QLogic driver will write some rare &quot;dropped<br>

frame(s) detected&quot; from time to time during normal server operations, but it<br>

will never write this &quot;Discard RND Frame&quot; message unless there&#39;s going to be<br>

an unwanted reboot right after. No scsi layer read and write communication<br>

on sdb and sdc gets recorded by the kernel afterwards, except for a couple<br>

of &quot;device ready&quot; commands. All these info have been shared with the SAN<br>

department already.<br>

<br>

Yesterday the SAN department has made a parameter configuration change on<br>

the two Brocade switches (and multipath worked smoothlessly on the servers,<br>

switching paths back and forth as the relative switches got restarted) I<br>

hope this fixes the problem, otherwise we might investigate the switch port<br>

configuration change described in the following link, as our current<br>

configuration seems to apply (8Gb fc, Brocade switches, lots of er_bad_os<br>

port errors, fill word port mode currently set to 1, and random server<br>

problem)<br>

<br>

<a href="http://loopbackconnector.com/2013/02/14/brocade-8-gb-how-to-talk-when-idle-p

ortcfgfillword/" target="_blank">http://loopbackconnector.com/2013/02/14/brocade-8-gb-how-to-talk-when-idle-p<br>

ortcfgfillword/</a><br>

<br>

andrea<br>

<br>

<br>

----------------------------------------------------------------------<br>

<br>

Message: 1<br>

Date: Fri, 3 May 2013 10:17:12 +0200<br>

From: Lars Marowsky-Bree &lt;<a href="mailto:lmb@suse.com">lmb@suse.com</a>&gt;<br>

<div class="im">To: The Pacemaker cluster resource manager<br>

        &lt;<a href="mailto:pacemaker@oss.clusterlabs.org">pacemaker@oss.clusterlabs.org</a>&gt;<br>

Subject: Re: [Pacemaker] Frequent SBD triggered server reboots<br>

</div>Message-ID: &lt;<a href="mailto:20130503081712.GE3705@suse.de">20130503081712.GE3705@suse.de</a>&gt;<br>

Content-Type: text/plain; charset=iso-8859-1<br>

<div class="im"><br>

On 2013-05-03T02:49:54, andrea cuozzo &lt;<a href="mailto:andrea.cuozzo@sysma.it">andrea.cuozzo@sysma.it</a>&gt; wrote:<br>

<br>

&gt; Unfortunately Os and SP version for the Oracle project these clusters<br>

&gt; belong to have been decided several layers over my head, I&#39;ll make it<br>

&gt; a point for upgrading to Sp2 anyway, I might get lucky.<br>

<br>

Good luck with that!<br>

<br>

&gt; the SAN department investigate their side of the problem, I&#39;ll take a<br>

&gt; look at trying a different stonith resources, all servers involved<br>

&gt; have some kind of IBM management console. Thanks for your answers to<br>

&gt; my questions and for your time, very much appreciated.<br>

<br>

You&#39;re missing out on many further fixes since SP1 went out of support.<br>

Not just to sbd, but everything, from kernel to pacemaker to glibc and back.<br>

<br>

Since support is obviously irrelevant to your management, you could consider<br>

recompiling sbd from source if you were so inclined, though.<br>

<br>

<br>

<br>

Regards,<br>

    Lars<br>

<br>

--<br>

Architect Storage/HA<br>

</div><div class="im">SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imend?rffer,<br>

</div>HRB 21284 (AG N?rnberg) &quot;Experience is the name everyone gives to their<br>

mistakes.&quot; -- Oscar Wilde<br>

<br>

<br>

<br>

<br>

<br>_______________________________________________<br>

Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>

<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>

<br>

Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>

<br></blockquote></div><br><br clear="all"><br>-- <br>esta es mi vida e me la vivo hasta que dios quiera

</div>