[ClusterLabs] fence_scsi no such device
marvin
xzarth at gmail.com
Mon Mar 21 13:39:07 UTC 2016
On 03/15/2016 03:39 PM, Ken Gaillot wrote:
> On 03/15/2016 09:10 AM, marvin wrote:
>> Hi,
>>
>> I'm trying to get fence_scsi working, but i get "no such device" error.
>> It's a two node cluster with nodes called "node01" and "node03". The OS
>> is RHEL 7.2.
>>
>> here is some relevant info:
>>
>> # pcs status
>> Cluster name: testrhel7cluster
>> Last updated: Tue Mar 15 15:05:40 2016 Last change: Tue Mar 15
>> 14:33:39 2016 by root via cibadmin on node01
>> Stack: corosync
>> Current DC: node03 (version 1.1.13-10.el7-44eb2dd) - partition with quorum
>> 2 nodes and 23 resources configured
>>
>> Online: [ node01 node03 ]
>>
>> Full list of resources:
>>
>> Clone Set: dlm-clone [dlm]
>> Started: [ node01 node03 ]
>> Clone Set: clvmd-clone [clvmd]
>> Started: [ node01 node03 ]
>> fence-node1 (stonith:fence_ipmilan): Started node03
>> fence-node3 (stonith:fence_ipmilan): Started node01
>> Resource Group: test_grupa
>> test_ip (ocf::heartbeat:IPaddr): Started node01
>> lv_testdbcl (ocf::heartbeat:LVM): Started node01
>> fs_testdbcl (ocf::heartbeat:Filesystem): Started node01
>> oracle11_baza (ocf::heartbeat:oracle): Started node01
>> oracle11_lsnr (ocf::heartbeat:oralsnr): Started node01
>> fence-scsi-node1 (stonith:fence_scsi): Started node03
>> fence-scsi-node3 (stonith:fence_scsi): Started node01
>>
>> PCSD Status:
>> node01: Online
>> node03: Online
>>
>> Daemon Status:
>> corosync: active/enabled
>> pacemaker: active/enabled
>> pcsd: active/enabled
>>
>> # pcs stonith show
>> fence-node1 (stonith:fence_ipmilan): Started node03
>> fence-node3 (stonith:fence_ipmilan): Started node01
>> fence-scsi-node1 (stonith:fence_scsi): Started node03
>> fence-scsi-node3 (stonith:fence_scsi): Started node01
>> Node: node01
>> Level 1 - fence-scsi-node3
>> Level 2 - fence-node3
>> Node: node03
>> Level 1 - fence-scsi-node1
>> Level 2 - fence-node1
>>
>> # pcs stonith show fence-scsi-node1 --all
>> Resource: fence-scsi-node1 (class=stonith type=fence_scsi)
>> Attributes: pcmk_host_list=node01 pcmk_monitor_action=metadata
>> pcmk_reboot_action=off
>> Meta Attrs: provides=unfencing
>> Operations: monitor interval=60s (fence-scsi-node1-monitor-interval-60s)
>>
>> # pcs stonith show fence-scsi-node3 --all
>> Resource: fence-scsi-node3 (class=stonith type=fence_scsi)
>> Attributes: pcmk_host_list=node03 pcmk_monitor_action=metadata
>> pcmk_reboot_action=off
>> Meta Attrs: provides=unfencing
>> Operations: monitor interval=60s (fence-scsi-node3-monitor-interval-60s)
>>
>> node01 # pcs stonith fence node03
>> Error: unable to fence 'node03'
>> Command failed: No such device
>>
>> node01 # tail /var/log/messages
>> Mar 15 14:54:04 node01 stonith-ng[20024]: notice: Client
>> stonith_admin.29191.2b7fe910 wants to fence (reboot) 'node03' with
>> device '(any)'
>> Mar 15 14:54:04 node01 stonith-ng[20024]: notice: Initiating remote
>> operation reboot for node03: d1df9201-5bb1-447f-9b40-d3d7235c3d0a (0)
>> Mar 15 14:54:04 node01 stonith-ng[20024]: notice: fence-scsi-node3 can
>> fence (reboot) node03: static-list
>> Mar 15 14:54:04 node01 stonith-ng[20024]: notice: fence-node3 can fence
>> (reboot) node03: static-list
>> Mar 15 14:54:04 node01 stonith-ng[20024]: notice: All fencing options
>> to fence node03 for stonith_admin.29191 at node01.d1df9201 failed
> The above line is the key. Both of the devices registered for node03
> returned failure. Pacemaker then looked for any other device capable of
> fencing node03 and there is none, so that's why it reported "No such
> device" (an admittedly obscure message).
>
> It looks like the fence agents require more configuration options than
> you have set. If you run "/path/to/fence/agent -o metadata", you can see
> the available options. It's a good idea to first get the agent running
> successfully manually on the command line ("status" command is usually
> sufficient), then put those same options in the cluster configuration.
>
Made some progress, found new issue.
So i get the scsi_fence to work, it unfences at start, and fences when i
tell it to.
The problem is when I, for instance, fence node01. It stops pacemaker
but leaves corosync, so node01 is in "pending" state and node03 won't
stop services until node01 is restarted. The keys seem to be handled
correctly.
Before fence:
# pcs status
Cluster name: testrhel7cluster
Last updated: Mon Mar 21 14:26:53 2016 Last change: Mon Mar 21
14:26:27 2016 by root via crm_resource on node01
Stack: corosync
Current DC: node01 (version 1.1.13-10.el7-44eb2dd) - partition with quorum
2 nodes and 21 resources configured
Online: [ node01 node03 ]
Full list of resources:
Clone Set: dlm-clone [dlm]
Started: [ node01 node03 ]
Clone Set: clvmd-clone [clvmd]
Started: [ node01 node03 ]
Resource Group: test_grupa
test_ip (ocf::heartbeat:IPaddr): Started node01
lv_testdbcl (ocf::heartbeat:LVM): Started node01
fs_testdbcl (ocf::heartbeat:Filesystem): Started node01
oracle11_baza (ocf::heartbeat:oracle): Started node01
oracle11_lsnr (ocf::heartbeat:oralsnr): Started node01
Resource Group: oracle12_test
oracle12_ip (ocf::heartbeat:IPaddr): Started node03
lv_testdbcl12 (ocf::heartbeat:LVM): Started node03
fs_testdbcl12 (ocf::heartbeat:Filesystem): Started node03
oracle12_baza (ocf::heartbeat:oracle): Started node03
oracle12_lsnr (ocf::heartbeat:oralsnr): Started node03
scsi-node03 (stonith:fence_scsi): Started node03
scsi-node01 (stonith:fence_scsi): Started node01
PCSD Status:
node01: Online
node03: Online
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
After fence:
# pcs status
Cluster name: testrhel7cluster
Last updated: Mon Mar 21 14:28:40 2016 Last change: Mon Mar 21
14:26:27 2016 by root via crm_resource on node01
Stack: corosync
Current DC: node03 (version 1.1.13-10.el7-44eb2dd) - partition with quorum
2 nodes and 21 resources configured
Node node01: pending
Online: [ node03 ]
Full list of resources:
Clone Set: dlm-clone [dlm]
Started: [ node03 ]
Stopped: [ node01 ]
Clone Set: clvmd-clone [clvmd]
Started: [ node03 ]
Stopped: [ node01 ]
Resource Group: test_grupa
test_ip (ocf::heartbeat:IPaddr): Stopped
lv_testdbcl (ocf::heartbeat:LVM): Stopped
fs_testdbcl (ocf::heartbeat:Filesystem): Stopped
oracle11_baza (ocf::heartbeat:oracle): Stopped
oracle11_lsnr (ocf::heartbeat:oralsnr): Stopped
Resource Group: oracle12_test
oracle12_ip (ocf::heartbeat:IPaddr): Started node03
lv_testdbcl12 (ocf::heartbeat:LVM): Started node03
fs_testdbcl12 (ocf::heartbeat:Filesystem): Started node03
oracle12_baza (ocf::heartbeat:oracle): Started node03
oracle12_lsnr (ocf::heartbeat:oralsnr): Started node03
scsi-node03 (stonith:fence_scsi): Started node03
scsi-node01 (stonith:fence_scsi): Stopped
PCSD Status:
node01: Online
node03: Online
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
See the node01 in pending state.
The stonith config is this:
# pcs stonith show scsi-node01 --all
Resource: scsi-node01 (class=stonith type=fence_scsi)
Attributes: pcmk_host_list=node01,node03 pcmk_host_check=static-list
pcmk_monitor_action=metadata pcmk_reboot_action=off
logfile=/var/log/cluster/fence_scsi.log verbose=3
Meta Attrs: provides=unfencing
Operations: monitor interval=60s (scsi-node01-monitor-interval-60s)
# pcs stonith show scsi-node03 --all
Resource: scsi-node03 (class=stonith type=fence_scsi)
Attributes: pcmk_host_list=node01,node03 pcmk_host_check=static-list
pcmk_monitor_action=metadata pcmk_reboot_action=off
logfile=/var/log/cluster/fence_scsi.log verbose=3
Meta Attrs: provides=unfencing
Operations: monitor interval=60s (scsi-node03-monitor-interval-60s)
As soon as i restart or disconnect from network node01 the services
start on node03.
Is this somehow expected behavior or is something weird going on here?
More information about the Users
mailing list