[Pacemaker] node1 fencing itself after node2 being fenced

Mon Feb 10 22:43:37 EST 2014

10.02.2014 18:54, Asgaroth wrote:
> 
> 
> -----Original Message-----
> From: Vladislav Bogdanov [mailto:bubble at hoster-ok.com] 
> Sent: 10 February 2014 13:27
> To: pacemaker at oss.clusterlabs.org
> Subject: Re: [Pacemaker] node1 fencing itself after node2 being fenced
> 
> 
> I cannot really recall if it hangs or returns error for that (I moved to
> corosync2 long ago).
> 
> Are you running corosync2 on RHEL7 beta? Are we able to run corosync2 on
> CentOS 6/RHEL 6?

Nope, it's Centos6. In few words, It is probably safer for you to stay
with cman, especially if you need GFS2. gfs_controld is not officially
ported to corosync2 and is obsolete in EL7 because communication between
gfs2 and dlm is moved to kernelspace there.

> 
> Anyways you probably want to run clvmd with debugging enabled.
> iirc you have two choices here, either you'd need to stop running instance
> first and then run it in the console with -f -d1, or run clvmd -C -d2 to ask
> all running instances to start debug logging to syslog.
> I prefer first one, because modern syslogs do rate-limiting.
> And, you'd need to run lvm commands with debugging enabled too.
> 
> Thanks for this tip, I have modified clvmd to run in debug mode ("clvmd -T60
> -d 2 -I cman") and I notice that on node2 reboot, I don't see any logs for
> clvmd actually attempting to start, so it appears there is something wrong
> here with clvmd. However, I did try to manually stop/start clvmd on node2

You need to fix that for sure.

> after a reboot and these were the error logs reported:
> 
> Feb 10 12:37:08 test02 kernel: dlm: connecting to 1 sctp association 2
> Feb 10 12:38:00 test02 kernel: dlm: Using SCTP for communications
> Feb 10 12:38:00 test02 clvmd[2118]: Unable to create DLM lockspace for CLVM:
> Address already in use
> Feb 10 12:38:00 test02 kernel: dlm: Can't bind to port 21064 addr number 1
> Feb 10 12:38:00 test02 kernel: dlm: cannot start dlm lowcomms -98
> Feb 10 12:39:37 test02 kernel: dlm: Using SCTP for communications

Strange message, looks like something is bound to that port already.
You may want to try dlm in tcp mode btw.

> Feb 10 12:39:37 test02 clvmd[2137]: Unable to create DLM lockspace for CLVM:
> Address already in use
> Feb 10 12:39:37 test02 kernel: dlm: Can't bind to port 21064 addr number 1
> Feb 10 12:39:37 test02 kernel: dlm: cannot start dlm lowcomms -98
> Feb 10 12:47:21 test02 clvmd[2159]: Unable to create DLM lockspace for CLVM:
> Address already in use
> Feb 10 12:47:21 test02 kernel: dlm: Using SCTP for communications
> Feb 10 12:47:21 test02 kernel: dlm: Can't bind to port 21064 addr number 1
> Feb 10 12:47:21 test02 kernel: dlm: cannot start dlm lowcomms -98
> Feb 10 12:48:14 test02 kernel: dlm: closing connection to node 2
> Feb 10 12:48:14 test02 kernel: dlm: closing connection to node 1
> 
> So it appears that the issue is with clvmd attempting to communicated with,
> I presume, dlm. I tried to do some searching on this error and it appears
> there is a bug report, if I recall correctly, around 2004, which was fixed,
> so I cannot see why this error is cropping up. Some other strangeness is,
> that if I reboot the node a couple times, it may start up properly on 2nd
> node and then things appear to work properly, however, while node 2 is
> "down" the clvmd on node1 is still in a "hung" state even though dlm appears
> to think everything is good. Have you come across this issue before?
> 
> Thanks for your assistance thus far, I appreciate it.
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>