[Pacemaker] node1 fencing itself after node2 being fenced

Asgaroth lists at blueface.com
Mon Feb 10 10:54:45 EST 2014



-----Original Message-----
From: Vladislav Bogdanov [mailto:bubble at hoster-ok.com] 
Sent: 10 February 2014 13:27
To: pacemaker at oss.clusterlabs.org
Subject: Re: [Pacemaker] node1 fencing itself after node2 being fenced


I cannot really recall if it hangs or returns error for that (I moved to
corosync2 long ago).

Are you running corosync2 on RHEL7 beta? Are we able to run corosync2 on
CentOS 6/RHEL 6?

Anyways you probably want to run clvmd with debugging enabled.
iirc you have two choices here, either you'd need to stop running instance
first and then run it in the console with -f -d1, or run clvmd -C -d2 to ask
all running instances to start debug logging to syslog.
I prefer first one, because modern syslogs do rate-limiting.
And, you'd need to run lvm commands with debugging enabled too.

Thanks for this tip, I have modified clvmd to run in debug mode ("clvmd -T60
-d 2 -I cman") and I notice that on node2 reboot, I don't see any logs for
clvmd actually attempting to start, so it appears there is something wrong
here with clvmd. However, I did try to manually stop/start clvmd on node2
after a reboot and these were the error logs reported:

Feb 10 12:37:08 test02 kernel: dlm: connecting to 1 sctp association 2
Feb 10 12:38:00 test02 kernel: dlm: Using SCTP for communications
Feb 10 12:38:00 test02 clvmd[2118]: Unable to create DLM lockspace for CLVM:
Address already in use
Feb 10 12:38:00 test02 kernel: dlm: Can't bind to port 21064 addr number 1
Feb 10 12:38:00 test02 kernel: dlm: cannot start dlm lowcomms -98
Feb 10 12:39:37 test02 kernel: dlm: Using SCTP for communications
Feb 10 12:39:37 test02 clvmd[2137]: Unable to create DLM lockspace for CLVM:
Address already in use
Feb 10 12:39:37 test02 kernel: dlm: Can't bind to port 21064 addr number 1
Feb 10 12:39:37 test02 kernel: dlm: cannot start dlm lowcomms -98
Feb 10 12:47:21 test02 clvmd[2159]: Unable to create DLM lockspace for CLVM:
Address already in use
Feb 10 12:47:21 test02 kernel: dlm: Using SCTP for communications
Feb 10 12:47:21 test02 kernel: dlm: Can't bind to port 21064 addr number 1
Feb 10 12:47:21 test02 kernel: dlm: cannot start dlm lowcomms -98
Feb 10 12:48:14 test02 kernel: dlm: closing connection to node 2
Feb 10 12:48:14 test02 kernel: dlm: closing connection to node 1

So it appears that the issue is with clvmd attempting to communicated with,
I presume, dlm. I tried to do some searching on this error and it appears
there is a bug report, if I recall correctly, around 2004, which was fixed,
so I cannot see why this error is cropping up. Some other strangeness is,
that if I reboot the node a couple times, it may start up properly on 2nd
node and then things appear to work properly, however, while node 2 is
"down" the clvmd on node1 is still in a "hung" state even though dlm appears
to think everything is good. Have you come across this issue before?

Thanks for your assistance thus far, I appreciate it.





More information about the Pacemaker mailing list