[Pacemaker] 2 sbd devices and stonith-ng is showing (1 active devices)

Fri Mar 16 07:49:10 UTC 2012

Hello,

For the question if I read that article: yes.

About number of resources there is only:

"The sbd agent does not need to and should not be cloned. If all of your nodes run SBD, as is most likely, not even a monitor action provides a real benefit, since the daemon would suicide the node if there was a problem."

There isn't exactly information that there should be only one resource, If there is only one resource  and something will happen with storage where is the sbd device, only one node will detect that, because the monitoring is running only on the side where is the resource. 

Why is the behavior important for us because we had situation where we lost one of the sbd device and result was that whole cluster was rebooted, even there was still second one available

And from the doc what I have read

The fence resource is used to send message to the slot on the shared device

"The sbd daemon runs on all nodes in the cluster, monitoring the shared storage. When it either loses access to the majority of sbd devices, or sees that another node has written a fencing request to its mailbox slot, the node will immediately fence itself. "

If there is no resource agent for stonith with sbd how the node will send request to another node? Or all nodes are sharing one ? or if there is some issue with this shared resource that there will be no fence action, because the nodes will be not able send request to shared device?
In situation where are two nodes and there will be no network connection, and the split brain situation will appear, and this one shared resource will have problem , that the situation will not handle this correctly, and from my point of view one resource for sbd is single point of failure, therefore when one node has its own it will avoid those all situations.

Very nice example for test:

Configure one or two sbd devices on cluster, on device mapper multipath  target. Then use echo 1 > /sys/block/sdXXX/device/delete and delete all devices from the multipath target.  Now all IO operation on that multipath device will be hanged.  In this scenario server will never release that the resource with sbd is not working on the affected node. Now when the second node will need do sbd operation how you will manage that?

Best regards

Jozef

> We have configured pacemaker on HAE  from novell:
> 
> cat /etc/sysconfig/sbd
> SBD_DEVICE="/dev/mapper/SHARED1_part1;/dev/mapper/SHARED2_part1"
> SBD_OPTS="-W"
> 
> I'm running  2 instances of watcher
> 
> root      9157     1  0 11:00 pts/0    00:00:00 sbd: inquisitor
> root      9158  9157  0 11:00 pts/0    00:00:00 sbd: watcher: /dev/mapper/SHARED1_part1 - slot: 0
> root      9159  9157  0 11:00 pts/0    00:00:00 sbd: watcher: /dev/mapper/SHARED2_part1 - slot: 1

That looks fine, but did you read
http://www.linux-ha.org/wiki/SBD_Fencing about the limitations of using
2 devices?

> I have running one resource per node
> 
> Online: [ b300ple0 b400ple0 ]
> 
> sbd_fense_b400  (stonith:external/sbd): Started b400ple0
> sbd_fense_b300  (stonith:external/sbd): Started b300ple0

Why that? Did you read http://www.linux-ha.org/wiki/SBD_Fencing?

You only have to have one external/sbd per cluster. A single primitive is sufficient. No need to run several, nor to clone them.

> Mar 17 11:03:51 b400ple0 stonith-ng: [9467]: info: 
> stonith_device_register: Added 'sbd_fense_b400' to the device list (1 
> active devices)

Yes, because from the point of view of the stonith-ng, there is only one "sbd" device, though that internally uses two storage devices - which stonith-ng doesn't know about.

> I got:
> 
> 15 11:15:33 b300ple0 stonith-ng: [8546]: debug: exec_child_done: Got 
> 60 more bytes: Performing: stonith -t external/sbd -S  failed:  
> 0.05859375 Mar 15 11:15:33 b300ple0 stonith-ng: [8546]: notice: 
> log_operation: Operation 'monitor' [15803] for device 'sbd_fense_b300' 
> returned: 1 Mar 15 11:15:33 b300ple0 stonith-ng: [8546]: debug: 
> log_operation: sbd_fense_b300 output: Performing: stonith -t 
> external/sbd -S Mar 15 11:15:33 b300ple0 stonith-ng: [8546]: debug: 
> log_operation: sbd_fense_b300 output: failed:  0.05859375 Mar 15 
> 11:15:33 b300ple0 lrm-stonith: [15802]: debug: execra: 
> sbd_fense_b300_monitor returned 1 Mar 15 11:15:33 b300ple0 stonith-ng: 
> [8546]: debug: log_operation: sbd_fense_b300 output:  (total 60 bytes)

The agent itself should also have logged something.

Have you, by chance, configured one external/sbd instance per device?
That would be wrong; you need to run one external/sbd instance per cluster for all devices.

Did you read http://www.linux-ha.org/wiki/SBD_Fencing?

Regards,
    Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imend?rffer, HRB 21284 (AG N?rnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde