[ClusterLabs] Antw: Re: Antw: Re: [Question] About movement of pacemaker_remote.
Andrew Beekhof
andrew at beekhof.net
Wed Aug 5 17:55:00 EDT 2015
Ok, I’ll look into it. Thanks for retesting.
> On 5 Aug 2015, at 4:00 pm, renayama19661014 at ybb.ne.jp wrote:
>
> Hi Andrew,
>
>>> Do you know if this behaviour still exists?
>>> A LOT of work went into the remote node logic in the last couple of months,
>> its
>>> possible this was fixed as a side-effect.
>>
>>
>> It is the latest and does not confirm it.
>> I confirm it.
>
>
> I confirmed it in latest Pacemaker.(pacemaker-eefdc909a41b571dc2e155f7b14b5ef0368f2de7)
>
> After all the phenomenon occurs.
>
>
> In the first clean up, pacemaker fails in connection with pacemaker_remote.
> The second succeeds.
>
> The problem does not seem to be settled somehow or other.
>
>
>
> It was the latest and incorporated my log again.
>
> -------
> (snip)
> static size_tcrm_remote_recv_once(crm_remote_t * remote){ int rc = 0;
> size_t read_len = sizeof(struct crm_remote_header_v0);
> struct crm_remote_header_v0 *header = crm_remote_header(remote);
>
> if(header) {
> /* Stop at the end of the current message */
> read_len = header->size_total;
> }
>
> /* automatically grow the buffer when needed */
> if(remote->buffer_size < read_len) {
> remote->buffer_size = 2 * read_len;
> crm_trace("Expanding buffer to %u bytes", remote->buffer_size);
>
> remote->buffer = realloc_safe(remote->buffer, remote->buffer_size + 1); CRM_ASSERT(remote->buffer != NULL);
> }
>
> #ifdef HAVE_GNUTLS_GNUTLS_H
> if (remote->tls_session) { if (remote->buffer == NULL) {
> crm_info("### YAMAUCHI buffer is NULL [buffer_zie[%d] readlen[%d]", remote->buffer_size, read_len);
> }
> rc = gnutls_record_recv(*(remote->tls_session),
> remote->buffer + remote->buffer_offset,
> remote->buffer_size - remote->buffer_offset);
> (snip)
> -------
>
> When Pacemaker fails in connection first in remote, my log is printed.
> My log is not printed by the second connection.
>
> [root at sl7-01 ~]# tail -f /var/log/messages | grep YAMA
> Aug 5 14:46:25 sl7-01 crmd[21306]: info: ### YAMAUCHI buffer is NULL [buffer_zie[1326] readlen[40]
> Aug 5 14:46:26 sl7-01 crmd[21306]: info: ### YAMAUCHI buffer is NULL [buffer_zie[1326] readlen[40]
> Aug 5 14:46:28 sl7-01 crmd[21306]: info: ### YAMAUCHI buffer is NULL [buffer_zie[1326] readlen[40]
> Aug 5 14:46:30 sl7-01 crmd[21306]: info: ### YAMAUCHI buffer is NULL [buffer_zie[1326] readlen[40]
> Aug 5 14:46:31 sl7-01 crmd[21306]: info: ### YAMAUCHI buffer is NULL [buffer_zie[1326] readlen[40]
> (snip)
>
> Best Regards,
> Hideo Yamauchi.
>
>
>
>
> ----- Original Message -----
>> From: "renayama19661014 at ybb.ne.jp" <renayama19661014 at ybb.ne.jp>
>> To: Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
>> Cc:
>> Date: 2015/8/4, Tue 18:40
>> Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: [Question] About movement of pacemaker_remote.
>>
>> Hi Andrew,
>>
>>> Do you know if this behaviour still exists?
>>> A LOT of work went into the remote node logic in the last couple of months,
>> its
>>> possible this was fixed as a side-effect.
>>
>>
>> It is the latest and does not confirm it.
>> I confirm it.
>>
>> Many Thanks!
>> Hideo Yamauchi.
>>
>>
>> ----- Original Message -----
>>> From: Andrew Beekhof <andrew at beekhof.net>
>>> To: renayama19661014 at ybb.ne.jp; Cluster Labs - All topics related to
>> open-source clustering welcomed <users at clusterlabs.org>
>>> Cc:
>>> Date: 2015/8/4, Tue 13:16
>>> Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: [Question] About movement of
>> pacemaker_remote.
>>>
>>>
>>>> On 12 May 2015, at 12:12 pm, renayama19661014 at ybb.ne.jp wrote:
>>>>
>>>> Hi All,
>>>>
>>>> The problem is like a buffer becoming NULL after crm_resouce -C
>> practice
>>> somehow or other after having rebooted remote node.
>>>>
>>>> I incorporated log in a source code and confirmed it.
>>>>
>>>> ------------------------------------------------
>>>> crm_remote_recv_once(crm_remote_t * remote)
>>>> {
>>>> (snip)
>>>> /* automatically grow the buffer when needed */
>>>> if(remote->buffer_size < read_len) {
>>>> remote->buffer_size = 2 * read_len;
>>>> crm_trace("Expanding buffer to %u bytes",
>>> remote->buffer_size);
>>>>
>>>> remote->buffer = realloc_safe(remote->buffer,
>>> remote->buffer_size + 1);
>>>> CRM_ASSERT(remote->buffer != NULL);
>>>> }
>>>>
>>>> #ifdef HAVE_GNUTLS_GNUTLS_H
>>>> if (remote->tls_session) {
>>>> if (remote->buffer == NULL) {
>>>> crm_info("### YAMAUCHI buffer is NULL [buffer_zie[%d]
>>> readlen[%d]", remote->buffer_size, read_len);
>>>> }
>>>> rc = gnutls_record_recv(*(remote->tls_session),
>>>> remote->buffer +
>>> remote->buffer_offset,
>>>> remote->buffer_size -
>>> remote->buffer_offset);
>>>> (snip)
>>>> ------------------------------------------------
>>>>
>>>> May 12 10:54:01 sl7-01 crmd[30447]: info: crm_remote_recv_once: ###
>>> YAMAUCHI buffer is NULL [buffer_zie[1326] readlen[40]
>>>> May 12 10:54:02 sl7-01 crmd[30447]: info: crm_remote_recv_once: ###
>>> YAMAUCHI buffer is NULL [buffer_zie[1326] readlen[40]
>>>> May 12 10:54:04 sl7-01 crmd[30447]: info: crm_remote_recv_once: ###
>>> YAMAUCHI buffer is NULL [buffer_zie[1326] readlen[40]
>>>
>>> Do you know if this behaviour still exists?
>>> A LOT of work went into the remote node logic in the last couple of months,
>> its
>>> possible this was fixed as a side-effect.
>>>
>>>>
>>>> ------------------------------------------------
>>>>
>>>> gnutls_record_recv processes an empty buffer and becomes the error.
>>>>
>>>> ------------------------------------------------
>>>> (snip)
>>>> ssize_t
>>>> _gnutls_recv_int(gnutls_session_t session, content_type_t type,
>>>> gnutls_handshake_description_t htype,
>>>> gnutls_packet_t *packet,
>>>> uint8_t * data, size_t data_size, void *seq,
>>>> unsigned int ms)
>>>> {
>>>> int ret;
>>>>
>>>> if (packet == NULL && (type != GNUTLS_ALERT && type !=
>>
>>> GNUTLS_HEARTBEAT)
>>>> && (data_size == 0 || data == NULL))
>>>> return gnutls_assert_val(GNUTLS_E_INVALID_REQUEST);
>>>>
>>>> (sip)
>>>> ssize_t
>>>> gnutls_record_recv(gnutls_session_t session, void *data, size_t
>> data_size)
>>>> {
>>>> return _gnutls_recv_int(session, GNUTLS_APPLICATION_DATA, -1, NULL,
>>>> data, data_size, NULL,
>>>> session->internals.record_timeout_ms);
>>>> }
>>>> (snip)
>>>> ------------------------------------------------
>>>>
>>>> Best Regards,
>>>> Hideo Yamauchi.
>>>>
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "renayama19661014 at ybb.ne.jp"
>>> <renayama19661014 at ybb.ne.jp>
>>>>> To: "users at clusterlabs.org"
>> <users at clusterlabs.org>
>>>>> Cc:
>>>>> Date: 2015/5/11, Mon 16:45
>>>>> Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: [Question] About
>>> movement of pacemaker_remote.
>>>>>
>>>>> Hi Ulrich,
>>>>>
>>>>> Thank you for comments.
>>>>>
>>>>>> So your host and you resource are both named
>> "snmp1"? I
>>> also
>>>>> don't
>>>>>> have much experience with cleaning up resources for a node
>> that is
>>> offline.
>>>>> What
>>>>>> change should it make (while the node is offline)?
>>>>>
>>>>>
>>>>> The name of the remote resource and the name of the remote node
>> make
>>> same
>>>>> "snmp1".
>>>>>
>>>>>
>>>>> (snip)
>>>>> primitive snmp1 ocf:pacemaker:remote \
>>>>> params \
>>>>> server="snmp1" \
>>>>> op start interval="0s" timeout="60s"
>>>>> on-fail="ignore" \
>>>>> op monitor interval="3s" timeout="15s"
>>
>>> \
>>>>> op stop interval="0s" timeout="60s"
>>>>> on-fail="ignore"
>>>>>
>>>>> primitive Host-rsc1 ocf:heartbeat:Dummy \
>>>>> op start interval="0s" timeout="60s"
>>>>> on-fail="restart" \
>>>>> op monitor interval="10s"
>> timeout="60s"
>>>>> on-fail="restart" \
>>>>> op stop interval="0s" timeout="60s"
>>>>> on-fail="ignore"
>>>>>
>>>>> primitive Remote-rsc1 ocf:heartbeat:Dummy \
>>>>> op start interval="0s" timeout="60s"
>>>>> on-fail="restart" \
>>>>> op monitor interval="10s"
>> timeout="60s"
>>>>> on-fail="restart" \
>>>>> op stop interval="0s" timeout="60s"
>>>>> on-fail="ignore"
>>>>>
>>>>> location loc1 Remote-rsc1 \
>>>>> rule 200: #uname eq snmp1
>>>>> location loc3 Host-rsc1 \
>>>>> rule 200: #uname eq bl460g8n1
>>>>> (snip)
>>>>>
>>>>> The pacemaker_remoted of the snmp1 node stops in SIGTERM.
>>>>> I reboot pacemaker_remoted of the snmp1 node afterwards.
>>>>> And I execute crm_resource command, but the snmp1 node remains
>>> off-line.
>>>>>
>>>>> After having executed crm_resource command, the remote node thinks
>> that
>>> it is
>>>>> right movement to become the snmp1 online.
>>>>>
>>>>>
>>>>>
>>>>> Best Regards,
>>>>> Hideo Yamauchi.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>
>>>>>> To: users at clusterlabs.org; renayama19661014 at ybb.ne.jp
>>>>>> Cc:
>>>>>> Date: 2015/5/11, Mon 15:39
>>>>>> Subject: Antw: Re: [ClusterLabs] Antw: Re: [Question] About
>>> movement of
>>>>> pacemaker_remote.
>>>>>>
>>>>>>>>> <renayama19661014 at ybb.ne.jp> schrieb am
>>> 11.05.2015 um
>>>>> 06:22
>>>>>> in Nachricht
>>>>>> <361916.15877.qm at web200006.mail.kks.yahoo.co.jp>:
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I matched the OS version of the remote node with a host
>> once
>>> again and
>>>>>
>>>>>>> confirmed it in Pacemaker1.1.13-rc2.
>>>>>>>
>>>>>>> It was the same even if I made a host
>> RHEL7.1.(bl460g8n1)
>>>>>>> I made the remote host RHEL7.1.(snmp1)
>>>>>>>
>>>>>>> The first crm_resource -C fails.
>>>>>>> --------------------------------
>>>>>>> [root at bl460g8n1 ~]# crm_resource -C -r snmp1
>>>>>>> Cleaning up snmp1 on bl460g8n1
>>>>>>> Waiting for 1 replies from the CRMd. OK
>>>>>>>
>>>>>>> [root at bl460g8n1 ~]# crm_mon -1 -Af
>>>>>>> Last updated: Mon May 11 12:44:31 2015
>>>>>>> Last change: Mon May 11 12:43:30 2015
>>>>>>> Stack: corosync
>>>>>>> Current DC: bl460g8n1 - partition WITHOUT quorum
>>>>>>> Version: 1.1.12-7a2e3ae
>>>>>>> 2 Nodes configured
>>>>>>> 3 Resources configured
>>>>>>>
>>>>>>>
>>>>>>> Online: [ bl460g8n1 ]
>>>>>>> RemoteOFFLINE: [ snmp1 ]
>>>>>>
>>>>>> So your host and you resource are both named
>> "snmp1"? I
>>> also
>>>>> don't
>>>>>> have much experience with cleaning up resources for a node
>> that is
>>> offline.
>>>>> What
>>>>>> change should it make (while the node is offline)?
>>>>>>
>>>>>>>
>>>>>>> Host-rsc1 (ocf::heartbeat:Dummy): Started bl460g8n1
>>>>>>> Remote-rsc1 (ocf::heartbeat:Dummy): Started bl460g8n1
>>
>>> (failure
>>>>> ignored)
>>>>>>>
>>>>>>> Node Attributes:
>>>>>>> * Node bl460g8n1:
>>>>>>> + ringnumber_0 : 192.168.101.21
>> is UP
>>>>>>> + ringnumber_1 : 192.168.102.21
>> is UP
>>>>>>>
>>>>>>> Migration summary:
>>>>>>> * Node bl460g8n1:
>>>>>>> snmp1: migration-threshold=1 fail-count=1000000
>>>>> last-failure='Mon
>>>>>> May 11
>>>>>>> 12:44:28 2015'
>>>>>>>
>>>>>>> Failed actions:
>>>>>>> snmp1_start_0 on bl460g8n1 'unknown error'
>> (1):
>>> call=5,
>>>>>> status=Timed
>>>>>>> Out, exit-reason='none', last-rc-change='Mon
>> May
>>> 11
>>>>> 12:43:31
>>>>>> 2015', queued=0ms,
>>>>>>> exec=0ms
>>>>>>> --------------------------------
>>>>>>>
>>>>>>>
>>>>>>> The second crm_resource -C succeeded and was connected
>> to the
>>> remote
>>>>> host.
>>>>>>
>>>>>> Then the node was online it seems.
>>>>>>
>>>>>> Regards,
>>>>>> Ulrich
>>>>>>
>>>>>>> --------------------------------
>>>>>>> [root at bl460g8n1 ~]# crm_mon -1 -Af
>>>>>>> Last updated: Mon May 11 12:44:54 2015
>>>>>>> Last change: Mon May 11 12:44:48 2015
>>>>>>> Stack: corosync
>>>>>>> Current DC: bl460g8n1 - partition WITHOUT quorum
>>>>>>> Version: 1.1.12-7a2e3ae
>>>>>>> 2 Nodes configured
>>>>>>> 3 Resources configured
>>>>>>>
>>>>>>>
>>>>>>> Online: [ bl460g8n1 ]
>>>>>>> RemoteOnline: [ snmp1 ]
>>>>>>>
>>>>>>> Host-rsc1 (ocf::heartbeat:Dummy): Started bl460g8n1
>>>>>>> Remote-rsc1 (ocf::heartbeat:Dummy): Started snmp1
>>>>>>> snmp1 (ocf::pacemaker:remote): Started bl460g8n1
>>>>>>>
>>>>>>> Node Attributes:
>>>>>>> * Node bl460g8n1:
>>>>>>> + ringnumber_0 : 192.168.101.21
>> is UP
>>>>>>> + ringnumber_1 : 192.168.102.21
>> is UP
>>>>>>> * Node snmp1:
>>>>>>>
>>>>>>> Migration summary:
>>>>>>> * Node bl460g8n1:
>>>>>>> * Node snmp1:
>>>>>>> --------------------------------
>>>>>>>
>>>>>>> The gnutls of a host and the remote node was the next
>>> version.
>>>>>>>
>>>>>>> gnutls-devel-3.3.8-12.el7.x86_64
>>>>>>> gnutls-dane-3.3.8-12.el7.x86_64
>>>>>>> gnutls-c++-3.3.8-12.el7.x86_64
>>>>>>> gnutls-3.3.8-12.el7.x86_64
>>>>>>> gnutls-utils-3.3.8-12.el7.x86_64
>>>>>>>
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Hideo Yamauchi.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>> From: "renayama19661014 at ybb.ne.jp"
>>>>>> <renayama19661014 at ybb.ne.jp>
>>>>>>>> To: Cluster Labs - All topics related to open-source
>>
>>> clustering
>>>>>> welcomed
>>>>>>> <users at clusterlabs.org>
>>>>>>>> Cc:
>>>>>>>> Date: 2015/4/28, Tue 14:06
>>>>>>>> Subject: Re: [ClusterLabs] Antw: Re: [Question]
>> About
>>> movement of
>>>>>>> pacemaker_remote.
>>>>>>>>
>>>>>>>> Hi David,
>>>>>>>>
>>>>>>>> Even if the result changed the remote node to
>> RHEL7.1, it
>>> was the
>>>>> same.
>>>>>>>>
>>>>>>>>
>>>>>>>> I try it with a host node of pacemaker as RHEL7.1
>> this
>>> time.
>>>>>>>>
>>>>>>>>
>>>>>>>> I noticed an interesting phenomenon.
>>>>>>>> The remote node fails in a reconnection in the first
>>
>>> crm_resource.
>>>>>>>> However, the remote node succeeds in a reconnection
>> in
>>> the second
>>>>>>> crm_resource.
>>>>>>>>
>>>>>>>> I think that I have some problem with the point
>> where I
>>> cut the
>>>>>> connection
>>>>>>> with
>>>>>>>> the remote node first.
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>> Hideo Yamauchi.
>>>>>>>>
>>>>>>>>
>>>>>>>> ----- Original Message -----
>>>>>>>>> From: "renayama19661014 at ybb.ne.jp"
>>>>>>>> <renayama19661014 at ybb.ne.jp>
>>>>>>>>> To: Cluster Labs - All topics related to
>> open-source
>>>>> clustering
>>>>>> welcomed
>>>>>>>> <users at clusterlabs.org>
>>>>>>>>> Cc:
>>>>>>>>> Date: 2015/4/28, Tue 11:52
>>>>>>>>> Subject: Re: [ClusterLabs] Antw: Re: [Question]
>> About
>>>
>>>>> movement of
>>>>>>>> pacemaker_remote.
>>>>>>>>>
>>>>>>>>> Hi David,
>>>>>>>>> Thank you for comments.
>>>>>>>>>> At first glance this looks gnutls related.
>>> GNUTLS is
>>>>>> returning -50
>>>>>>>> during
>>>>>>>>> receive
>>>>>>>>>
>>>>>>>>>> on the client side (pacemaker's side).
>> -50
>>> maps to
>>>>>> 'invalid
>>>>>>>>> request'. >debug: crm_remote_recv_once:
>>
>>> TLS
>>>>> receive
>>>>>> failed: The
>>>>>>>>> request is invalid. >We treat this error as
>> fatal
>>> and
>>>>> destroy
>>>>>> the
>>>>>>>> connection.
>>>>>>>>> I've never encountered
>>>>>>>>>> this error and I don't know what causes
>> it.
>>> It's
>>>>>> possible
>>>>>>>>> there's a bug in
>>>>>>>>>> our gnutls usage... it's also possible
>>> there's a
>>>>> bug
>>>>>> in the
>>>>>>>> version
>>>>>>>>> of gnutls
>>>>>>>>>> that is in use as well.
>>>>>>>>> We built the remote node in RHEL6.5.
>>>>>>>>> Because it may be a problem of gnutls, I confirm
>> it
>>> in
>>>>> RHEL7.1.
>>>>>>>>>
>>>>>>>>> Best Regards,
>>>>>>>>> Hideo Yamauchi.
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>>>>>>
>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>> Getting started:
>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>>>>>
>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>> Getting started:
>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>>>>
>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>> Getting started:
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list: Users at clusterlabs.org
>>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>
>>>> _______________________________________________
>>>> Users mailing list: Users at clusterlabs.org
>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list