[Pacemaker] ocfs2_controld.pcmk process issue
Andrew Beekhof
andrew at beekhof.net
Wed May 16 00:34:39 UTC 2012
Is this on SLES by any chance?
SUSE are about the only ones with knowledge in this area I'm afraid.
On Tue, May 15, 2012 at 6:01 AM, Matthew O'Connor <matt at ecsorl.com> wrote:
> Hi!
>
> I ran into the issue of ocfs2_controld.pcmk consuming vast CPU again -
> twice, actually. The most recent happenstance was after a multi-node
> failure. One node stayed alive, two nodes had to be rebooted. After
> the reboots, one of the two came back without issue, and was able to
> mount the OCFS2 stores. The second node exhibited high-cpu usage on the
> ocfs2_controld.pcmk process, and could not mount the OCFS2 stores. The
> logs were being voraciously filled with the following message:
>
> ocfs2_controld: Unable to open checkpoint "ocfs2:controld": Object
> does not exist
>
> This message was being output so frequently that syslogd was starting to
> rate-limit it. I suspect this accounts for the high CPU usage. After
> restarting the troubled node several times, I found the solution was to
> order the OCFS2/DLM resource group to stop, cluster-wide, and then
> restart it. Normal behavior followed. (In a prior post to the list, I
> referenced hard-killing the ocfs2_controld.pcmk process. This was a
> more graceful shutdown.)
>
> Attached are two strace outputs. I'm sorry I'm not very familiar with
> strace, so the value of these files may be questionable. If there is
> anything else I can provide the next time this happens, I'd be happy to
> do so! The log-f.txt file was generated with the -f option, and the
> log-fc.txt file was generated with -f -c.
>
> Here also is a snippet from the syslog, during the cluster-wide shutdown
> of the OCFS2/DLM group:
>
> May 14 15:22:13 gw05 ocfs2_controld: Unable to open checkpoint
> "ocfs2:controld": Object does not exist
> May 14 15:22:14 ocfs2_controld: last message repeated 199 times
> May 14 15:22:15 gw05 o2cb[4134]: INFO: Stopping ocfs2_controld.pcmk
> May 14 15:22:16 gw05 dlm_controld.pcmk: [3411]: notice:
> terminate_ais_connection: Disconnecting from AIS
> May 14 15:22:16 gw05 lrmd: [2993]: info: RA output:
> (p_dlm:2:stop:stderr) dlm_controld.pcmk: no process found
> May 14 15:22:19 gw05 ocfs2_controld: Unable to open checkpoint
> "ocfs2:controld": Object does not exist
> May 14 15:22:20 ocfs2_controld: last message repeated 199 times
> May 14 15:22:25 gw05 ocfs2_controld: Unable to open checkpoint
> "ocfs2:controld": Object does not exist
> May 14 15:22:26 ocfs2_controld: last message repeated 199 times
> May 14 15:22:31 gw05 ocfs2_controld: Unable to open checkpoint
> "ocfs2:controld": Object does not exist
> May 14 15:22:32 ocfs2_controld: last message repeated 199 times
> May 14 15:22:37 gw05 ocfs2_controld: Unable to open checkpoint
> "ocfs2:controld": Object does not exist
> May 14 15:22:38 ocfs2_controld: last message repeated 199 times
>
> One other interesting bit of log (well, to me), was this bit that
> occurred when I tried to manually mount the OCFS2 store on the afflicted
> server:
>
> mount.ocfs2: Unable to access cluster service while trying to join
> the group
>
> One other note - I discovered I had not specified a monitor for either
> the pacemaker:o2cb or the pacemaker:controld RA. Could that have
> possibly triggered this issue?
>
> --
>
> Sincerely,
> Matthew O'Connor
>
> -----------------------------------------------------------------
> Sr. Software Engineer
> PGP/GPG Key: 0x55F981C4
> Fingerprint: E5DC A0F8 5A40 E4DA 2CE6 B5A2 014C 2CBF 55F9 81C4
>
> Engineering and Computer Simulations, Inc.
> 11825 High Tech Ave Suite 250
> Orlando, FL 32817
>
> Tel: 407-823-9991 x315
> Fax: 407-823-8299
> Email: matt at ecsorl.com
> Web: www.ecsorl.com
> -----------------------------------------------------------------
>
> CONFIDENTIAL NOTICE: The information contained in this electronic
> message is legally privileged, confidential and exempt from disclosure
> under applicable law. It is intended only for the use of the individual
> or entity named above. If the reader of this message is not the intended
> recipient, you are hereby notified that any dissemination, distribution
> or copying of this message is strictly prohibited. If you have received
> this communication in error, please notify the sender immediately by
> return e-mail and delete the original message and any copies of it from
> your computer system. Thank you.
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
More information about the Pacemaker
mailing list