[Pacemaker] can we update an attribute with cmpxchg "atomic compare and exchange" semantics?
Andrew Beekhof
andrew at beekhof.net
Tue Sep 30 03:51:21 UTC 2014
On 30 Sep 2014, at 6:22 am, Lars Ellenberg <Lars.Ellenberg at linbit.com> wrote:
> On Wed, Sep 10, 2014 at 11:50:58AM +0200, Lars Ellenberg wrote:
>>
>> Hi Andrew (and others).
>>
>> For a certain use case (yes, I'm talking about DRBD "peer-fencing" on
>> loss of replication link), it would be nice to be able to say:
>>
>> update some_attribute=some_attribute+1 where some_attribute >= 0
>>
>> delete some_attribute where some_attribute=0
>>
>> Ok, that's not the classic cmpxchg(), more of an atomic_add();
>> or similar enough. With hopefully just a single cib roundrip.
>>
>>
>> Let me rephrase:
>> Update attribute "this_is_pink" (for node-X with ID attr-ID):
>>
>> fail if said attr-ID exists elsewhere (not as the intended attribute
>> at the intended place in the xml tree)
>> (this comes for free already, I think)
>>
>> if it does not exist at all, assume it was present with current value 0
>>
>> if the current (or assumed current) value is >= 0, add 1
>>
>> if the current value is < 0, fail
>>
>> (optionally: return new value? old value?)
>
> Did anyone read this?
Yep, but it requires a non-trivial answer so it got deferred :)
Its a reasonable request, we've spoken about something similar in the past and its clear that at some point attrd needs to grow some extra capabilities.
Exactly when it will bubble up to the top of the todo list is less certain, though I would happily coach someone with the necessary motivation.
The other thing to mention is that currently the only part that wont work is "if the current value is < 0, fail".
Setting value="value++" will do the rest.
So my question would be... how important is the 'lt 0' case?
Actually, come to think of it, it's not a bad default behaviour.
Certainly failing value++ if value=-INFINITY would be logically consistent with the existing code.
Would that be sufficient?
>
>> My intended use case scenario is this:
>>
>> Two DRBD nodes, several DRBD resources,
>> at least a few of them in "dual-primary".
>>
>> Replication link breaks.
>>
>> Fence-peer handlers are triggered individually for each resource on
>> both nodes, and try to concurrently modify the cib (place fencing
>> constraints).
>>
>> With the current implementation of crm-fence-peer.sh, it is likely that
>> some DRBD resources "win" on one node, some "win" on the other node.
>> The respective losers will have their IO blocked.
>>
>> Which means that most likely on both nodes some DRBD will stay blocked,
>> some monitor operation will soon fail, some stop operation (to recover
>> from the monitor fail) will soon fail, and the recovery of that will be
>> node-level fencing of the affected node.
>>
>> In short: both nodes will be hard-reset
>> because of a replication link failure.
>>
>>
>>
>> If I would instead use a single attribute (with a pre-determined ID) for all
>> instances of the fence-peer handler, the first to come would "chose" the
>> victim node, all others would just add their count.
>> There will be only one loser, and more importantly: one survivor.
>>
>> Once the replication link is re-established,
>> DRBD resynchronization will bring the former loser up-to-date,
>> and the respective after-resync handlers will decrease that "breakage
>> count". Once the breakage count hits zero, it can and should be deleted.
>>
>> Presence of the "breakage count" attribute with value > 0 would mean
>> "this node must not be promoted", which would be a static constraint
>> to be added to all DRBD resources.
>>
>> Does that make sense?
>>
>> (I have more insane proposals, in case we have multiple (more than 2)
>> Primaries during normal operation, but I'm not yet able to write them
>> down without being seriously confused by myself...)
>>
>>
>> I could open-code it with shell and cibadmin, btw.
>> I did a proof-of-concept once that does
>> a. cibadmin -Q
>> b. some calculations,
>> then prepares the update statement xml based on cib content seen,
>> *including* the cib generation counters
>> c. cibadmin -R (or -C, -M, -D, as appropriate)
>> this will fail if the cib was modified in a relevant way since a,
>> because of the included generation counters
>> d. repeat as necessary
>>
>>
>> But that is beyond ugly.
>> And probably fragile.
>> And would often fail for all the wrong reasons, just because some status
>> code has changed and bumped the cib generation counters.
>>
>> What would be needed to add such functionality?
>> Where would it go?
>> cibadmin? cib? crm_attribute? possibly also attrd?
>>
>> Thanks,
>> Lars
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
>
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140930/eee5435d/attachment-0004.sig>
More information about the Pacemaker
mailing list