[Pacemaker] ocfs2_controld.pcmk process issue
Matthew O'Connor
matt at ecsorl.com
Wed May 16 02:42:37 UTC 2012
I'm sorry, no. It's on Ubuntu 11.10... I was looking into grabbing a
copy of the SUSE community dvd iso the other night - would this come
with all the necessary packages for setting up Pacemaker/Corosync +
OCFS2? If nothing else I'd be happy to see if I could replicate the
issue consistently, and among at least two distributions.
On 5/15/2012 8:34 PM, Andrew Beekhof wrote:
> Is this on SLES by any chance?
> SUSE are about the only ones with knowledge in this area I'm afraid.
>
> On Tue, May 15, 2012 at 6:01 AM, Matthew O'Connor <matt at ecsorl.com> wrote:
>> Hi!
>>
>> I ran into the issue of ocfs2_controld.pcmk consuming vast CPU again -
>> twice, actually. The most recent happenstance was after a multi-node
>> failure. One node stayed alive, two nodes had to be rebooted. After
>> the reboots, one of the two came back without issue, and was able to
>> mount the OCFS2 stores. The second node exhibited high-cpu usage on the
>> ocfs2_controld.pcmk process, and could not mount the OCFS2 stores. The
>> logs were being voraciously filled with the following message:
>>
>> ocfs2_controld: Unable to open checkpoint "ocfs2:controld": Object
>> does not exist
>>
>> This message was being output so frequently that syslogd was starting to
>> rate-limit it. I suspect this accounts for the high CPU usage. After
>> restarting the troubled node several times, I found the solution was to
>> order the OCFS2/DLM resource group to stop, cluster-wide, and then
>> restart it. Normal behavior followed. (In a prior post to the list, I
>> referenced hard-killing the ocfs2_controld.pcmk process. This was a
>> more graceful shutdown.)
>>
>> Attached are two strace outputs. I'm sorry I'm not very familiar with
>> strace, so the value of these files may be questionable. If there is
>> anything else I can provide the next time this happens, I'd be happy to
>> do so! The log-f.txt file was generated with the -f option, and the
>> log-fc.txt file was generated with -f -c.
>>
>> Here also is a snippet from the syslog, during the cluster-wide shutdown
>> of the OCFS2/DLM group:
>>
>> May 14 15:22:13 gw05 ocfs2_controld: Unable to open checkpoint
>> "ocfs2:controld": Object does not exist
>> May 14 15:22:14 ocfs2_controld: last message repeated 199 times
>> May 14 15:22:15 gw05 o2cb[4134]: INFO: Stopping ocfs2_controld.pcmk
>> May 14 15:22:16 gw05 dlm_controld.pcmk: [3411]: notice:
>> terminate_ais_connection: Disconnecting from AIS
>> May 14 15:22:16 gw05 lrmd: [2993]: info: RA output:
>> (p_dlm:2:stop:stderr) dlm_controld.pcmk: no process found
>> May 14 15:22:19 gw05 ocfs2_controld: Unable to open checkpoint
>> "ocfs2:controld": Object does not exist
>> May 14 15:22:20 ocfs2_controld: last message repeated 199 times
>> May 14 15:22:25 gw05 ocfs2_controld: Unable to open checkpoint
>> "ocfs2:controld": Object does not exist
>> May 14 15:22:26 ocfs2_controld: last message repeated 199 times
>> May 14 15:22:31 gw05 ocfs2_controld: Unable to open checkpoint
>> "ocfs2:controld": Object does not exist
>> May 14 15:22:32 ocfs2_controld: last message repeated 199 times
>> May 14 15:22:37 gw05 ocfs2_controld: Unable to open checkpoint
>> "ocfs2:controld": Object does not exist
>> May 14 15:22:38 ocfs2_controld: last message repeated 199 times
>>
>> One other interesting bit of log (well, to me), was this bit that
>> occurred when I tried to manually mount the OCFS2 store on the afflicted
>> server:
>>
>> mount.ocfs2: Unable to access cluster service while trying to join
>> the group
>>
>> One other note - I discovered I had not specified a monitor for either
>> the pacemaker:o2cb or the pacemaker:controld RA. Could that have
>> possibly triggered this issue?
>>
>> --
>>
>> Sincerely,
>> Matthew O'Connor
>>
>> -----------------------------------------------------------------
>> Sr. Software Engineer
>> PGP/GPG Key: 0x55F981C4
>> Fingerprint: E5DC A0F8 5A40 E4DA 2CE6 B5A2 014C 2CBF 55F9 81C4
>>
>> Engineering and Computer Simulations, Inc.
>> 11825 High Tech Ave Suite 250
>> Orlando, FL 32817
>>
>> Tel: 407-823-9991 x315
>> Fax: 407-823-8299
>> Email: matt at ecsorl.com
>> Web: www.ecsorl.com
>> -----------------------------------------------------------------
>>
>> CONFIDENTIAL NOTICE: The information contained in this electronic
>> message is legally privileged, confidential and exempt from disclosure
>> under applicable law. It is intended only for the use of the individual
>> or entity named above. If the reader of this message is not the intended
>> recipient, you are hereby notified that any dissemination, distribution
>> or copying of this message is strictly prohibited. If you have received
>> this communication in error, please notify the sender immediately by
>> return e-mail and delete the original message and any copies of it from
>> your computer system. Thank you.
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
--
Sincerely,
Matthew O'Connor
-----------------------------------------------------------------
Sr. Software Engineer
PGP/GPG Key: 0x55F981C4
Fingerprint: E5DC A0F8 5A40 E4DA 2CE6 B5A2 014C 2CBF 55F9 81C4
Engineering and Computer Simulations, Inc.
11825 High Tech Ave Suite 250
Orlando, FL 32817
Tel: 407-823-9991 x315
Fax: 407-823-8299
Email: matt at ecsorl.com
Web: www.ecsorl.com
-----------------------------------------------------------------
CONFIDENTIAL NOTICE: The information contained in this electronic
message is legally privileged, confidential and exempt from disclosure
under applicable law. It is intended only for the use of the individual
or entity named above. If the reader of this message is not the intended
recipient, you are hereby notified that any dissemination, distribution
or copying of this message is strictly prohibited. If you have received
this communication in error, please notify the sender immediately by
return e-mail and delete the original message and any copies of it from
your computer system. Thank you.
More information about the Pacemaker
mailing list