[Pacemaker] Cluster Volume Group is stuck

Thu May 12 12:28:40 EDT 2011

Thank you, that was the solution, now our stonith-timeout is 160s
our SBD Timeouts still are
     Timeout (watchdog) : 60
     Timeout (msgwait)  : 120
Yes they are long to avoid any problems with multipath driver.
We found similar recommended values in the latest SuSE SLES HA Guide.

Karl
> On 2011-05-12T15:16:52, Karl Rößmann <K.Roessmann at fkf.mpg.de> wrote:
>
>> This is an Update to my last Mail:
>>
>> SBD is running on one Node normally:
>
> I didn't mean to inquire wrt the external/sbd fencing agent, but the
> system daemon "sbd" - as configured via /etc/sysconfig/sbd and started
> (automatically) via /etc/init.d/openais (on SLE HA).
>
>> but after powering off Node multix246, it is running on two nodes:
>>
>>
>> Node multix246: UNCLEAN (offline)
>> Online: [ multix244 multix245 ]
>>
>>  Clone Set: dlm_clone [dlm]
>>      Started: [ multix244 multix245 ]
>>      Stopped: [ dlm:2 ]
>>  Clone Set: clvm_clone [clvm]
>>      Started: [ multix244 multix245 ]
>>      Stopped: [ clvm:2 ]
>>  Clone Set: vgsmet_clone [vgsmet]
>>      Started: [ multix244 multix245 ]
>>      Stopped: [ vgsmet:2 ]
>>  smetserv       (ocf::heartbeat:Xen):   Started multix244
>>  SBD_Stonith    (stonith:external/sbd) Started [ multix245   
>> multix246 ] <-----
>
> That is normal. It was running on the x246 node previously, but to fence
> said node, it needs to be started in the local partition.
>
> Normally, some seconds later, the fence should complete and multix246
> should change state to "OFFLINE". The state you see above is only
> transient.
>
> If it remains stuck in this state for more longer, I would assume the
> fence targetting multix246 isn't actually completing; do you see the
> fence/stonith request being issued in the logs? Is there an error
> message from sbd on multix245 in the above scenario?
>
> Ah! Got it -
>
> From your other mail, seeing that
>
>> sbd -d /dev/disk/by-id/scsi-3600a0b8000420d5a00001cf14dc3a9a2-part1 list
>> 0       multix244       clear
>> 1       multix245       clear
>> 2       multix246       reset   multix245
>
> suggests that multix246 actually was sent the request; and thus, should
> be considered 'fenced' by the remaining cluster.
>
> Looking back in your mails further:
>
>>> /dev/disk/by-id/scsi-3600a0b8000420d5a00001cf14dc3a9a2-part1 dump
>>> Header version     : 2
>>> Number of slots    : 255
>>> Sector size        : 512
>>> Timeout (watchdog) : 60
>>> Timeout (allocate) : 2
>>> Timeout (loop)     : 1
>>> Timeout (msgwait)  : 120
>
> You've set extremely long timeouts for the watchdog, and in particular
> for the msgwait - this means that a fence will only be considered
> completed after 120s by sbd. At the same time, you've set
> stonith-timeout to 60s, so if the fence takes longer than that, it'll be
> considered failed.
>
> You've set up your cluster so that it can never complete a successful
> fence - congratulations! ;-)
>
> If you've got a legitimate reason for setting the msgwait timeout to
> 120s, you need to set the stonith-timeout to >120s - 140s, for example.
>
>
> Regards,
>     Lars
>
> --
> Architect Storage/HA, OPS Engineering, Novell, Inc.
> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix  
> Imendörffer, HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
>


-- 
Karl Rößmann				Tel. +49-711-689-1657
Max-Planck-Institut FKF       		Fax. +49-711-689-1632
Postfach 800 665
70506 Stuttgart				email K.Roessmann at fkf.mpg.de