[ClusterLabs] DLM not working on my GFS2/pacemaker cluster
daniel at benoy.name
daniel at benoy.name
Tue Jan 19 17:50:44 UTC 2016
I've figured out the problem. Thank you everyone for the help.
It was the McAfee anti-virus that was recently installed that caused the
issue.
On 2016-01-19 10:20, daniel at benoy.name wrote:
> http://pastebin.com/qaeEAFWz
>
> On 2016-01-19 09:49, emmanuel segura wrote:
>> dlm_tool dump ?
>>
>> 2016-01-19 15:25 GMT+01:00 <daniel at benoy.name>:
>>> Yes, fencing is working, and SELinux is disabled.
>>>
>>> What configuration details do you require?
>>>
>>> Here's my corosync.conf: http://pastebin.com/SD1Gbdj0
>>> Here's my output from 'crm configure show':
>>> http://pastebin.com/eAiq2yJ9
>>>
>>> Another cluster is running fine with an identical configuration.
>>>
>>> On 2016-01-19 03:49, emmanuel segura wrote:
>>>>
>>>> please share your cluster config and say if your fencing is working.
>>>>
>>>> 2016-01-19 3:47 GMT+01:00 <daniel at benoy.name>:
>>>>>
>>>>> One of my clusters is having a problem. It's no longer able to set
>>>>> up its
>>>>> GFS2 mounts. I've narrowed the problem down a bit. Here's the
>>>>> output when
>>>>> I
>>>>> try to start the DLM daemon (Normally this is something
>>>>> corosync/pacemaker
>>>>> starts up for me, but here it is on the command line for the debug
>>>>> output):
>>>>> # dlm_controld -D -q 04561 dlm_controld 4.0.1 started
>>>>> 4561 our_nodeid 168528918
>>>>> 4561 found /dev/misc/dlm-control minor 56
>>>>> 4561 found /dev/misc/dlm-monitor minor 55
>>>>> 4561 found /dev/misc/dlm_plock minor 54
>>>>> 4561 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2
>>>>> 4561 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2
>>>>> 4561 cmap totem.rrp_mode = 'none'
>>>>> 4561 set protocol 0
>>>>> 4561 set recover_callbacks 1
>>>>> 4561 cmap totem.cluster_name = 'cwwba'
>>>>> 4561 set cluster_name cwwba
>>>>> 4561 /dev/misc/dlm-monitor fd 11
>>>>> 4561 cluster quorum 1 seq 672 nodes 2
>>>>> 4561 cluster node 168528918 added seq 672
>>>>> 4561 set_configfs_node 168528918 10.11.140.22 local 1
>>>>> 4561 /sys/kernel/config/dlm/cluster/comms/168528918/addr: open
>>>>> failed:
>>>>> 1
>>>>> 4561 cluster node 168528919 added seq 672
>>>>> 4561 set_configfs_node 168528919 10.11.140.23 local 0
>>>>> 4561 /sys/kernel/config/dlm/cluster/comms/168528919/addr: open
>>>>> failed:
>>>>> 1
>>>>> 4561 cpg_join dlm:controld ...
>>>>> 4561 setup_cpg_daemon 13
>>>>> 4561 dlm:controld conf 1 1 0 memb 168528918 join 168528918 left
>>>>> 4561 daemon joined 168528918
>>>>> 4561 fence work wait for cluster ringid
>>>>> 4561 dlm:controld ring 168528918:672 2 memb 168528918 168528919
>>>>> 4561 fence_in_progress_unknown 0 startup
>>>>> 4561 receive_protocol 168528918 max 3.1.1.0 run 0.0.0.0
>>>>> 4561 daemon node 168528918 prot max 0.0.0.0 run 0.0.0.0
>>>>> 4561 daemon node 168528918 save max 3.1.1.0 run 0.0.0.0
>>>>> 4561 set_protocol member_count 1 propose daemon 3.1.1 kernel
>>>>> 1.1.1
>>>>> 4561 receive_protocol 168528918 max 3.1.1.0 run 3.1.1.0
>>>>> 4561 daemon node 168528918 prot max 3.1.1.0 run 0.0.0.0
>>>>> 4561 daemon node 168528918 save max 3.1.1.0 run 3.1.1.0
>>>>> 4561 run protocol from nodeid 168528918
>>>>> 4561 daemon run 3.1.1 max 3.1.1 kernel run 1.1.1 max 1.1.1
>>>>> 4561 plocks 14
>>>>> 4561 receive_protocol 168528918 max 3.1.1.0 run 3.1.1.0
>>>>>
>>>>> As you can see, it's trying to configure the node addresses, but
>>>>> it's
>>>>> unable
>>>>> to write to the 'addr' file under the /sys/kernel/config configfs
>>>>> tree
>>>>> (See
>>>>> the 'open failed' lines above). I have no idea why. dmesg isn't
>>>>> saying
>>>>> anything. Nothing is telling me why it doesn't want me writing
>>>>> there. And
>>>>> I
>>>>> can confirm this behavior on the prompt as well.
>>>>>
>>>>> Trying to start CLVM results in complaints about the node not
>>>>> having an
>>>>> address set, which makes sense given the
>>>>>
>>>>> Here's the exact same command run twice. First, on a very similarly
>>>>> configured cluster (which is currently running):
>>>>> # cat /sys/kernel/config/dlm/cluster/comms/169446438/addrcat
>>>>> cat: /sys/kernel/config/dlm/cluster/comms/169446438/addr:
>>>>> Permission
>>>>> denied
>>>>> (That's what I expect to see. It's a write-only file.)
>>>>>
>>>>> And now on this messed up cluster:
>>>>> # cat /sys/kernel/config/dlm/cluster/comms/168528918/addr
>>>>> cat: /sys/kernel/config/dlm/cluster/comms/168528918/addr:
>>>>> Operation not
>>>>> permitted
>>>>>
>>>>> Why 'operation not permitted'? dmesg isn't telling me anything at
>>>>> all,
>>>>> and I
>>>>> don't see any way to get the kernel to spit out some kind of
>>>>> explanation
>>>>> for
>>>>> why it's blocking me. Can anyone help? At least point me in a
>>>>> direction
>>>>> where I can get the system to give me some indication why it's
>>>>> behaving
>>>>> this
>>>>> way?
>>>>>
>>>>> I'm running Ubuntu 14.04, and I've posted this on the Ubuntu forums
>>>>> as
>>>>> well:
>>>>> http://ubuntuforums.org/showthread.php?t=2310383
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list: Users at clusterlabs.org
>>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started:
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list