[Pacemaker] Patches: RFC before pull request

Thu Jan 8 00:27:13 EST 2015

They all look sane to me.  Please proceed with a pull request :-)

We should probably start thinking about .13 (or .14 for the superstitious), there have been quite a few important patches arrive since .12 was released. 

> On 10 Dec 2014, at 1:33 am, Lars Ellenberg <Lars.Ellenberg at linbit.com> wrote:
> 
> 
> Andrew,
> All,
> 
> Please have a look at the patches I queued up here:
> https://github.com/lge/pacemaker/commits/for-beekhof
> 
> Most (not all) are specific for the heartbeat cluster stack.
> 
> Thanks,
> 	Lars
> 
> A few comments here:
> 
> -----
> 
> This effectively changes crm_mon output,
> but also changes logging where this method is invoked:
> 
>    Low: native_print: report target-role as well
> 
>    This is for the "Why does my resource not start?" guys who
>    forgot to remove the limiting target-role setting.
> 
>    Report target role (unless "Started", which is the default anyways),
>    if it limits our abilities (Slave, Stopped),
>    or if it differs from the current status.
> 
> -----
> 
> Heartbeat specific:
> 
>    Low: allow heartbeat to spawn the pengine itself, and tell crmd about it
> 
>    Heartbeat 3.0.6 now may spawn the pengine directly, and will announce
>    this in the environment -- I introduced the setting "crmd_spawns_pengine".
> 
>    This improves shutdown behavior.  Otherwise I regularly find an orphaned
>    pengine process after pacemaker shutdown.
> 
> -----
> 
> Heartbeat specific, as consequence of the fix blow:
> 
>    Low: add debugging aid to help spot missing set_msg_callback()s on heartbeat
> 
>    In ha_msg_dispatch(), change from rcvmsg() to readmsg().
>    rcvmsg() is internally simply a wrapper around readmsg(),
>    which silently deletes messages without matching callback.
> 
>    Use readmsg() directly here. It will only return unprocessed (by
>    callbacks) messages, so log a warning, notice or debug message
>    depending on message header information, and ha_msg_del() it ourselves.
> 
> -----
> 
> Heartbeat specific bug fix:
> 
>    High: fix stonith ignoring its own messages on heartbeat
> 
>    Since the introduction of the additional F_TYPE messages
>    T_STONITH_NOTIFY and T_STONITH_TIMEOUT_VALUE, and their use as message
>    types in global heartbeat cluster messages, stonith-ng was broken on the
>    heartbeat cluster stack.
> 
>    When delegation was made the default, and the result could only be
>    reaped by listening for the T_STONITH_NOTIFY message, no-one (but
>    stonithd itself) would ever notice successful completion,
>    and stonith would be re-issued forever.
> 
>    Registering callbacks for these F_TYPE fixes these hung stonith and
>    stonith_admin operations on the heartbeat cluster stack.
> 
> -----
> 
> Heartbeat specific:
> 
>    Medium: fix tracking of peer client process status on heartbeat
> 
>    Don't optimistically assume that peer client processes are alive,
>    or that a node that can talk to us is in fact member of the same
>    ccm partition.
> 
>    Whenever ccm tells us about a new membership, *ask* for peer client
>    process status.
> 
> -----
> 
> This oneliner may well be relevant for corosync CPG as well,
> possibly one of the reasons the pcmk_cpg_membership() has this funny
> "appears to be online even though we think it is dead" block?
> 
>    fix crm_update_peer_proc to NOT ignore flags if partially set
> 
>    The "set_bit()" function used here actually deals with masks, not bit numbers.
>    The "flag" argument should in fact be plural: flags.
> 
>    These proc flag bits are not always set one at a time,
>    but for example as "crm_proc_crmd | crm_proc_cpg",
>    and not necessarily cleared with the same combination.
> 
>    Ignoring to-be-set flags just because *some* of the flag bits are
>    already set is clearly a bug, and may be the reason for stale process
>    cache information.
> 
> -----
> 
> Heartbeat specific:
> 
>    Medium: map heartbeat JOIN/LEAVE status to ONLINE/OFFLINE
> 
>    The rest of the code deals in "online" and "offline",
>    not "join" and "leave". Need to map these states,
>    or the rest of the code won't work properly.
> 
> -----
> 
> Generic, if shutdown is requested before stonith connection was ever established
> (due to other problems), inisting to re-try the stonith connection confused the shutdown.
> 
>    Medium: don't trigger a stonith_reconnect if no longer required
> 
>    Get rid of some spurious error messages, and speed up shutdown,
>    even if the connection to the stonith daemon failed.
> 
> -----
> 
> Non-functional change, just for readability:
> 
>    Low: use CRM_NODE_MEMBER, not CRM_NODE_ACTIVE
> 
>    ACTIVE is defined to be MEMBER anyways:
>    include/crm/cluster.h:#define CRM_NODE_ACTIVE    CRM_NODE_MEMBER
> 
>    Don't confuse the reader of the code
>    by implying it was something different.
> 
> -----
> 
> Heartbeat specific, packaging only:
> 
>    Low: heartbeat 3.0.6 knows to finds the daemons; drop compat symlinks
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org