[ClusterLabs] attrd does not clean per-node cache after node removal
Vladislav Bogdanov
bubble at hoster-ok.com
Wed Mar 23 17:01:15 UTC 2016
23.03.2016 19:52, Vladislav Bogdanov wrote:
> 23.03.2016 19:39, Ken Gaillot wrote:
>> On 03/23/2016 07:35 AM, Vladislav Bogdanov wrote:
>>> Hi!
>>>
>>> It seems like atomic attrd in post-1.1.14 (eb89393) does not
>>> fully clean node cache after node is removed.
>>
>> Is this a regression? Or have you only tried it with this version?
>
> Only with this one.
>
>>
>>> After our QA guys remove node wa-test-server-ha-03 from a two-node
>>> cluster:
>>> * stop pacemaker and corosync on wa-test-server-ha-03
>>> * remove node wa-test-server-ha-03 from corosync nodelist on
>>> wa-test-server-ha-04
>>> * tune votequorum settings
>>> * reload corosync on wa-test-server-ha-04
>>> * remove node from pacemaker on wa-test-server-ha-04
>>> * delete everything from /var/lib/pacemaker/cib on wa-test-server-ha-03
>>> , and then join it with the different corosync ID (but with the same
>>> node name),
>>> we see the following in logs:
>>>
>>> Leave node 1 (wa-test-server-ha-03):
>>> Mar 23 04:19:53 wa-test-server-ha-04 attrd[25962]: notice:
>>> crm_update_peer_proc: Node wa-test-server-ha-03[1] - state is now
>>> lost (was member)
>>> Mar 23 04:19:53 wa-test-server-ha-04 attrd[25962]: notice: Removing
>>> all wa-test-server-ha-03 (1) attributes for attrd_peer_change_cb
>>> Mar 23 04:19:53 wa-test-server-ha-04 attrd[25962]: notice: Lost
>>> attribute writer wa-test-server-ha-03
>>> Mar 23 04:19:53 wa-test-server-ha-04 attrd[25962]: notice: Removing
>>> wa-test-server-ha-03/1 from the membership list
>>> Mar 23 04:19:53 wa-test-server-ha-04 attrd[25962]: notice: Purged 1
>>> peers with id=1 and/or uname=wa-test-server-ha-03 from the membership
>>> cache
>>> Mar 23 04:19:56 wa-test-server-ha-04 attrd[25962]: notice:
>>> Processing peer-remove from wa-test-server-ha-04: wa-test-server-ha-03 0
>>> Mar 23 04:19:56 wa-test-server-ha-04 attrd[25962]: notice: Removing
>>> all wa-test-server-ha-03 (0) attributes for wa-test-server-ha-04
>>> Mar 23 04:19:56 wa-test-server-ha-04 attrd[25962]: notice: Removing
>>> wa-test-server-ha-03/1 from the membership list
>>> Mar 23 04:19:56 wa-test-server-ha-04 attrd[25962]: notice: Purged 1
>>> peers with id=0 and/or uname=wa-test-server-ha-03 from the membership
>>> cache
>>>
>>> Join node 3 (the same one, wa-test-server-ha-03, but ID differs):
>>> Mar 23 04:21:23 wa-test-server-ha-04 attrd[25962]: notice:
>>> crm_update_peer_proc: Node wa-test-server-ha-03[3] - state is now
>>> member (was (null))
>>> Mar 23 04:21:26 wa-test-server-ha-04 attrd[25962]: warning:
>>> crm_find_peer: Node 3/wa-test-server-ha-03 = 0x201bf30 -
>>> a4cbcdeb-c36a-4a0e-8ed6-c45b3db89296
>>> Mar 23 04:21:26 wa-test-server-ha-04 attrd[25962]: warning:
>>> crm_find_peer: Node 2/wa-test-server-ha-04 = 0x1f90e20 -
>>> 6c18faa1-f8c2-4b0c-907c-20db450e2e79
>>> Mar 23 04:21:26 wa-test-server-ha-04 attrd[25962]: crit: Node 1
>>> and 3 share the same name 'wa-test-server-ha-03'
>>
>> It took me a while to understand the above combination of messages. This
>> is not node 3 joining. This is node 1 joining after node 3 has already
>> been seen.
>
> Hmmm...
> corosync.conf and corosync-cmapctl both say it is 3
> Also, cib lists it as 3 and lrmd puts its status records under 3.
I mean:
<node_state id="3" uname="wa-test-server-ha-03" crmd="online" crm-debug-origin="do_update_resource" in_ccm="true" join="member" expected="member">
<lrm id="3">
<lrm_resources>
...
</lrm_resources>
</lrm>
</node_state>
<node_state id="1">
<transient_attributes id="1">
<instance_attributes id="status-1">
<nvpair id="status-1-shutdown" name="shutdown" value="0"/>
<nvpair id="status-1-master-rabbitmq-local" name="master-rabbitmq-local" value="1"/>
<nvpair id="status-1-master-meta-0-0-drbd" name="master-meta-0-0-drbd" value="10000"/>
<nvpair id="status-1-master-staging-0-0-drbd" name="master-staging-0-0-drbd" value="10000"/>
<nvpair id="status-1-rabbit-start-time" name="rabbit-start-time" value="1458732136"/>
</instance_attributes>
</transient_attributes>
</node_state>
>
> Actually issue is that drbd resources are not promoted because their
> master attributes go to section with node-id 1. And that is the only
> reason why we found that. Everything not related to volatile attributes
> works well.
>
>>
>> The warnings are a complete dump of the peer cache. So you can see that
>> wa-test-server-ha-03 is listed only once, with id 3.
>>
>> The critical message ("Node 1 and 3") lists the new id first and the
>> found ID second. So id 1 is what it's trying to add to the cache.
>
> But there is also 'Node 'wa-test-server-ha-03' has changed its ID from 1
> to 3' - it goes first. Does that matter?
>
>>
>> Did you update the node ID in corosync.conf on *both* nodes?
>
> Sure.
> It is automatically copied to a node being joined.
>
>>
>>> Mar 23 04:21:29 wa-test-server-ha-04 attrd[25962]: notice: Node
>>> 'wa-test-server-ha-03' has changed its ID from 1 to 3
>>> Mar 23 04:21:29 wa-test-server-ha-04 attrd[25962]: warning:
>>> crm_find_peer: Node 3/wa-test-server-ha-03 = 0x201bf30 -
>>> a4cbcdeb-c36a-4a0e-8ed6-c45b3db89296
>>> Mar 23 04:21:29 wa-test-server-ha-04 attrd[25962]: warning:
>>> crm_find_peer: Node 2/wa-test-server-ha-04 = 0x1f90e20 -
>>> 6c18faa1-f8c2-4b0c-907c-20db450e2e79
>>> Mar 23 04:21:29 wa-test-server-ha-04 attrd[25962]: crit: Node 1
>>> and 3 share the same name 'wa-test-server-ha-03'
>>> Mar 23 04:21:31 wa-test-server-ha-04 attrd[25962]: notice: Node
>>> 'wa-test-server-ha-03' has changed its ID from 1 to 3
>>> Mar 23 04:21:31 wa-test-server-ha-04 attrd[25962]: warning:
>>> crm_find_peer: Node 3/wa-test-server-ha-03 = 0x201bf30 -
>>> a4cbcdeb-c36a-4a0e-8ed6-c45b3db89296
>>> Mar 23 04:21:31 wa-test-server-ha-04 attrd[25962]: warning:
>>> crm_find_peer: Node 2/wa-test-server-ha-04 = 0x1f90e20 -
>>> 6c18faa1-f8c2-4b0c-907c-20db450e2e79
>>> Mar 23 04:21:31 wa-test-server-ha-04 attrd[25962]: crit: Node 1
>>> and 3 share the same name 'wa-test-server-ha-03'
>>> Mar 23 04:21:31 wa-test-server-ha-04 attrd[25962]: notice: Node
>>> 'wa-test-server-ha-03' has changed its ID from 3 to 1
>>> ...
>>>
>>> On the node being joined:
>>> Mar 23 04:21:23 wa-test-server-ha-03 attrd[15260]: notice:
>>> Connecting to cluster infrastructure: corosync
>>> Mar 23 04:21:23 wa-test-server-ha-03 attrd[15260]: notice:
>>> crm_update_peer_proc: Node wa-test-server-ha-03[3] - state is now
>>> member (was (null))
>>> Mar 23 04:21:24 wa-test-server-ha-03 attrd[15260]: notice:
>>> crm_update_peer_proc: Node wa-test-server-ha-04[2] - state is now
>>> member (was (null))
>>> Mar 23 04:21:24 wa-test-server-ha-03 attrd[15260]: notice: Recorded
>>> attribute writer: wa-test-server-ha-04
>>> Mar 23 04:21:24 wa-test-server-ha-03 attrd[15260]: notice:
>>> Processing sync-response from wa-test-server-ha-04
>>> Mar 23 04:21:24 wa-test-server-ha-03 attrd[15260]: warning:
>>> crm_find_peer: Node 2/wa-test-server-ha-04 = 0xdfe620 -
>>> ad08ca96-295a-4fa4-99f9-8c8a2d0b6ac0
>>> Mar 23 04:21:24 wa-test-server-ha-03 attrd[15260]: warning:
>>> crm_find_peer: Node 3/wa-test-server-ha-03 = 0xd7ae20 -
>>> f85bdc4b-a3ee-47ff-bdd5-7c1dcf9fe97c
>>> Mar 23 04:21:24 wa-test-server-ha-03 attrd[15260]: crit: Node 1
>>> and 3 share the same name 'wa-test-server-ha-03'
>>> Mar 23 04:21:26 wa-test-server-ha-03 attrd[15260]: notice: Node
>>> 'wa-test-server-ha-03' has changed its ID from 1 to 3
>>> Mar 23 04:21:26 wa-test-server-ha-03 attrd[15260]: notice: Updating
>>> all attributes after cib_refresh_notify event
>>> Mar 23 04:21:26 wa-test-server-ha-03 attrd[15260]: notice: Updating
>>> all attributes after cib_refresh_notify event
>>>
>>>
>>> CIB status section after that contains entries for three nodes, with
>>> IDs 1, 2 and 3.
>>> For node 2 (that which remained) there are both transient attributes
>>> and lrm statuses
>>> For node 1 (that have been removed) - only transient attributes
>>> For node 3 (newly joined) - only lrm statuses
>>>
>>> That makes me think that not everything is removed (stale caches?)
>>> from attrd during node leave.
>>>
>>>
>>> Is there some other information I can provide to solve this issue?
>>>
>>> Best,
>>> Vladislav
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list