[Pacemaker] crm_resource -L not trustable right after restart
Andrew Beekhof
andrew at beekhof.net
Wed Jan 15 21:35:28 UTC 2014
On 16 Jan 2014, at 6:53 am, Brian J. Murrell (brian) <brian at interlinx.bc.ca> wrote:
> On Wed, 2014-01-15 at 17:11 +1100, Andrew Beekhof wrote:
>>
>> Consider any long running action, such as starting a database.
>> We do not update the CIB until after actions have completed, so there can and will be times when the status section is out of date to one degree or another.
>
> But that is the opposite of what I am reporting
I know, I was giving you another example of when the cib is not completely up-to-date with reality.
> and is acceptable. It's
> acceptable for a resource that is in the process of starting being
> reported as stopped, because it's not yet started.
It may very well be partially started. Its almost certainly not stopped which is what is being reported.
>
> What I am seeing is resources being reported as stopped when they are in
> fact started/running and have been for a long time.
>
>> At node startup is another point at which the status could potentially be behind.
>
> Right. Which is the case I am talking about.
>
>> It sounds to me like you're trying to second guess the cluster, which is a dangerous path.
>
> No, not trying to second guess at all.
You're not using the output to decide whether to perform some logic?
Because crm_mon is the more usual command to run right after startup (which would give you enough context to know things are still syncing).
> I'm just trying to ask the
> cluster what the state is and not getting the truth. I am willing to
> believe whatever state the cluster says it's in as long as what I am
> getting is the truth.
>
>> What if its the first node to start up?
>
> I'd think a timeout comes in to play here.
>
>> There'd be no fresh copy to arrive in that case.
>
> I can't say that I know how the CIB works internally/entirely, but I'd
> imagine that when a cluster node starts up it tries to see if there is a
> more fresh CIB out there in the cluster.
Nope.
> Maybe this is part of the
> process of choosing/discovering a DC.
DC election happens at the crmd. The cib is a dumb repository of name/value pairs.
It doesn't even understand new vs. old - only different.
> But ultimately if the node is the
> first one up, it will eventually figure that out so that it can nominate
> itself as the DC. Or it finds out that there is a DC already (and gets
> a fresh CIB from it?). It's during that window that I propose that
> crm_resource should not be asserting anything and should just admit that
> it does not (yet) know.
>
>> If it had enough information to know it was out of date, it wouldn't be out of date.
>
> But surely it understands if it is in the process of joining a cluster
> or not, and therefore does know enough to know that it doesn't know if
> it's out of date or not.
And if it has a newer config compared to the existing nodes?
> But that it could be.
>
>> As above, there are situations when you'd never get an answer.
>
> I should have added to my proposal "or has determined that there is
> nothing to refresh it's CIB from" and that it's local copy is
> authoritative for the whole cluster.
>
> b.
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140116/3314f729/attachment-0004.sig>
More information about the Pacemaker
mailing list