[Pacemaker] CIB write-to-disk bug?
Alan Robertson
alanr at unix.sh
Thu Apr 1 06:12:47 UTC 2010
OK....
Since there was no ssh-as-root between the cluster nodes, I didn't send
all the logs along from every node in the cluster - and it didn't occur
to me to look at all of them.
However, the problem has gotten curioser and curioser - because ALL the
nodes in the cluster reported the same problem at the same time...
That makes it a lot less likely to be a race condition with the disk
writing infrastructure...
I've attached the relevant lines from the various machines - slightly
processed (date stamp format changed and a few other minor things).
Let me know if you want me to send all the system logs along...
Alan Robertson wrote:
> Hi,
>
> I've run into what looks at first blush to be a CIB bug in writing to disk.
>
> The key messages from this incident are these:
>
>
> Mar 31 19:02:52 vhost0384 cib: [13294]: ERROR: validate_cib_digest:
> Digest comparision failed: expected 316049fa7ee8d2e107573ce7cded07cf
> (/var/lib/heartbeat/crm/cib.GUdD9T), calculated
> 0bac3440f5c42f0f37d22ea7dfe433e8
> Mar 31 19:02:52 vhost0384 cib: [13294]: ERROR: retrieveCib: Checksum of
> /var/lib/heartbeat/crm/cib.uHFtAW failed! Configuration contents ignored!
> Mar 31 19:02:52 vhost0384 cib: [13294]: ERROR: retrieveCib: Usually this
> is caused by manual changes, please refer to
> http://clusterlabs.org/wiki/FAQ#cib_changes_detected
> Mar 31 19:02:52 vhost0384 cib: [13294]: WARN: retrieveCib: Continuing
> but /var/lib/heartbeat/crm/cib.uHFtAW will NOT used.
>
>
> I did not make manual changes on a running CIB. I was using the cluster
> shell at the time. The CIB it is complaining about appears to be an
> intact, valid CIB with contents approximately like they should have been
> at the time. By the way, I have a report from another IBMer that they
> have seen systems that stop writing to their local CIBs. I'll contact him.
>
> Here are some relevant facts:
> These machines are virtual guests in a cloud somewhere - operations
> have somewhat unpredictable latency. But, nothing too egregious
> was happening at the time or Heartbeat would have bitched.
> I was doing some testing at the time. I was putting on and
> taking off constraints using the cluster shell
> migrate and unmigrate operations.
>
> Given that the file looks intact, and I know how the CIB is written to
> disk (since I originally wrote that code), I wonder if it isn't a
> versioning issue / race condition. That is, the code for writing to
> disk does NOT guarantee when it gets done (assuming you're still using
> it). It would be easy to do a checksum on the wrong version compared to
> the version you thought it should be (or before it completed).
>
> Andrew: You should have already received all the relevant logs to you
> on a separate email.
>
> Also, for my reference - what method are you using to compute the digest
> of the file? That is, what command should I execute to get the same
> results?
>
--
Alan Robertson <alanr at unix.sh>
"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: log.excerpt
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100401/b63bbe66/attachment-0002.ksh>
More information about the Pacemaker
mailing list