[ClusterLabs] Antw: Re: Antw: Re: [Question] About movement of pacemaker_remote.

Wed Aug 5 06:00:50 UTC 2015

Hi Andrew,

>> Do you know if this behaviour still exists?
>> A LOT of work went into the remote node logic in the last couple of months, 
> its 
>> possible this was fixed as a side-effect.
> 
> 
> It is the latest and does not confirm it.
> I confirm it.

I confirmed it in latest Pacemaker.(pacemaker-eefdc909a41b571dc2e155f7b14b5ef0368f2de7)

After all the phenomenon occurs.

In the first clean up, pacemaker fails in connection with pacemaker_remote.
The second succeeds.

The problem does not seem to be settled somehow or other.

It was the latest and incorporated my log again.

-------
(snip)
static size_tcrm_remote_recv_once(crm_remote_t * remote){    int rc = 0;
    size_t read_len = sizeof(struct crm_remote_header_v0);
    struct crm_remote_header_v0 *header = crm_remote_header(remote);

    if(header) {
        /* Stop at the end of the current message */
        read_len = header->size_total;
    }

    /* automatically grow the buffer when needed */
    if(remote->buffer_size < read_len) {
           remote->buffer_size = 2 * read_len;
        crm_trace("Expanding buffer to %u bytes", remote->buffer_size);

        remote->buffer = realloc_safe(remote->buffer, remote->buffer_size + 1);        CRM_ASSERT(remote->buffer != NULL);
    }

#ifdef HAVE_GNUTLS_GNUTLS_H
    if (remote->tls_session) {        if (remote->buffer == NULL) {
            crm_info("### YAMAUCHI buffer is NULL [buffer_zie[%d] readlen[%d]", remote->buffer_size, read_len);
        }
        rc = gnutls_record_recv(*(remote->tls_session),
                                remote->buffer + remote->buffer_offset,
                                remote->buffer_size - remote->buffer_offset);
(snip)
-------

When Pacemaker fails in connection first in remote, my log is printed.
My log is not printed by the second connection.

[root at sl7-01 ~]# tail -f /var/log/messages | grep YAMA
Aug  5 14:46:25 sl7-01 crmd[21306]: info: ### YAMAUCHI buffer is NULL [buffer_zie[1326] readlen[40]
Aug  5 14:46:26 sl7-01 crmd[21306]: info: ### YAMAUCHI buffer is NULL [buffer_zie[1326] readlen[40]
Aug  5 14:46:28 sl7-01 crmd[21306]: info: ### YAMAUCHI buffer is NULL [buffer_zie[1326] readlen[40]
Aug  5 14:46:30 sl7-01 crmd[21306]: info: ### YAMAUCHI buffer is NULL [buffer_zie[1326] readlen[40]
Aug  5 14:46:31 sl7-01 crmd[21306]: info: ### YAMAUCHI buffer is NULL [buffer_zie[1326] readlen[40]
(snip)

Best Regards,
Hideo Yamauchi.

----- Original Message -----
> From: "renayama19661014 at ybb.ne.jp" <renayama19661014 at ybb.ne.jp>
> To: Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
> Cc: 
> Date: 2015/8/4, Tue 18:40
> Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: [Question] About movement of pacemaker_remote.
> 
> Hi Andrew,
> 
>>  Do you know if this behaviour still exists?
>>  A LOT of work went into the remote node logic in the last couple of months, 
> its 
>>  possible this was fixed as a side-effect.
> 
> 
> It is the latest and does not confirm it.
> I confirm it.
> 
> Many Thanks!
> Hideo Yamauchi.
> 
> 
> ----- Original Message -----
>>  From: Andrew Beekhof <andrew at beekhof.net>
>>  To: renayama19661014 at ybb.ne.jp; Cluster Labs - All topics related to 
> open-source clustering welcomed <users at clusterlabs.org>
>>  Cc: 
>>  Date: 2015/8/4, Tue 13:16
>>  Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: [Question] About movement of 
> pacemaker_remote.
>> 
>> 
>>>   On 12 May 2015, at 12:12 pm, renayama19661014 at ybb.ne.jp wrote:
>>> 
>>>   Hi All,
>>> 
>>>   The problem is like a buffer becoming NULL after crm_resouce -C 
> practice 
>>  somehow or other after having rebooted remote node.
>>> 
>>>   I incorporated log in a source code and confirmed it.
>>> 
>>>   ------------------------------------------------
>>>   crm_remote_recv_once(crm_remote_t * remote)
>>>   {
>>>   (snip)
>>>       /* automatically grow the buffer when needed */
>>>       if(remote->buffer_size < read_len) {
>>>              remote->buffer_size = 2 * read_len;
>>>           crm_trace("Expanding buffer to %u bytes", 
>>  remote->buffer_size);
>>> 
>>>           remote->buffer = realloc_safe(remote->buffer, 
>>  remote->buffer_size + 1);
>>>           CRM_ASSERT(remote->buffer != NULL);
>>>       }
>>> 
>>>   #ifdef HAVE_GNUTLS_GNUTLS_H
>>>       if (remote->tls_session) {
>>>           if (remote->buffer == NULL) {
>>>          crm_info("### YAMAUCHI buffer is NULL [buffer_zie[%d] 
>>  readlen[%d]", remote->buffer_size, read_len);
>>>           }
>>>           rc = gnutls_record_recv(*(remote->tls_session),
>>>                                   remote->buffer + 
>>  remote->buffer_offset,
>>>                                   remote->buffer_size - 
>>  remote->buffer_offset);
>>>   (snip)
>>>   ------------------------------------------------
>>> 
>>>   May 12 10:54:01 sl7-01 crmd[30447]: info: crm_remote_recv_once: ### 
>>  YAMAUCHI buffer is NULL [buffer_zie[1326] readlen[40]
>>>   May 12 10:54:02 sl7-01 crmd[30447]: info: crm_remote_recv_once: ### 
>>  YAMAUCHI buffer is NULL [buffer_zie[1326] readlen[40]
>>>   May 12 10:54:04 sl7-01 crmd[30447]: info: crm_remote_recv_once: ### 
>>  YAMAUCHI buffer is NULL [buffer_zie[1326] readlen[40]
>> 
>>  Do you know if this behaviour still exists?
>>  A LOT of work went into the remote node logic in the last couple of months, 
> its 
>>  possible this was fixed as a side-effect.
>> 
>>> 
>>>   ------------------------------------------------
>>> 
>>>   gnutls_record_recv processes an empty buffer and becomes the error.
>>> 
>>>   ------------------------------------------------
>>>   (snip)
>>>   ssize_t
>>>   _gnutls_recv_int(gnutls_session_t session, content_type_t type,
>>>   gnutls_handshake_description_t htype,
>>>   gnutls_packet_t *packet,
>>>   uint8_t * data, size_t data_size, void *seq,
>>>   unsigned int ms)
>>>   {
>>>   int ret;
>>> 
>>>   if (packet == NULL && (type != GNUTLS_ALERT && type != 
> 
>>  GNUTLS_HEARTBEAT)
>>>      && (data_size == 0 || data == NULL))
>>>   return gnutls_assert_val(GNUTLS_E_INVALID_REQUEST);
>>> 
>>>   (sip)
>>>   ssize_t
>>>   gnutls_record_recv(gnutls_session_t session, void *data, size_t 
> data_size)
>>>   {
>>>   return _gnutls_recv_int(session, GNUTLS_APPLICATION_DATA, -1, NULL,
>>>   data, data_size, NULL,
>>>   session->internals.record_timeout_ms);
>>>   }
>>>   (snip)
>>>   ------------------------------------------------
>>> 
>>>   Best Regards,
>>>   Hideo Yamauchi.
>>> 
>>> 
>>> 
>>>   ----- Original Message -----
>>>>   From: "renayama19661014 at ybb.ne.jp" 
>>  <renayama19661014 at ybb.ne.jp>
>>>>   To: "users at clusterlabs.org" 
> <users at clusterlabs.org>
>>>>   Cc: 
>>>>   Date: 2015/5/11, Mon 16:45
>>>>   Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: [Question] About 
>>  movement of pacemaker_remote.
>>>> 
>>>>   Hi Ulrich,
>>>> 
>>>>   Thank you for comments.
>>>> 
>>>>>   So your host and you resource are both named 
> "snmp1"? I 
>>  also 
>>>>   don't 
>>>>>   have much experience with cleaning up resources for a node 
> that is 
>>  offline. 
>>>>   What 
>>>>>   change should it make (while the node is offline)?
>>>> 
>>>> 
>>>>   The name of the remote resource and the name of the remote node 
> make 
>>  same 
>>>>   "snmp1".
>>>> 
>>>> 
>>>>   (snip)
>>>>   primitive snmp1 ocf:pacemaker:remote \
>>>>           params \
>>>>                   server="snmp1" \
>>>>           op start interval="0s" timeout="60s" 
>>>>   on-fail="ignore" \
>>>>           op monitor interval="3s" timeout="15s" 
> 
>>  \
>>>>           op stop interval="0s" timeout="60s" 
>>>>   on-fail="ignore"
>>>> 
>>>>   primitive Host-rsc1 ocf:heartbeat:Dummy \
>>>>           op start interval="0s" timeout="60s" 
>>>>   on-fail="restart" \
>>>>           op monitor interval="10s" 
> timeout="60s" 
>>>>   on-fail="restart" \
>>>>           op stop interval="0s" timeout="60s" 
>>>>   on-fail="ignore"
>>>> 
>>>>   primitive Remote-rsc1 ocf:heartbeat:Dummy \
>>>>           op start interval="0s" timeout="60s" 
>>>>   on-fail="restart" \
>>>>           op monitor interval="10s" 
> timeout="60s" 
>>>>   on-fail="restart" \
>>>>           op stop interval="0s" timeout="60s" 
>>>>   on-fail="ignore"
>>>> 
>>>>   location loc1 Remote-rsc1 \
>>>>           rule 200: #uname eq snmp1
>>>>   location loc3 Host-rsc1 \
>>>>           rule 200: #uname eq bl460g8n1
>>>>   (snip)
>>>> 
>>>>   The pacemaker_remoted of the snmp1 node stops in SIGTERM.
>>>>   I reboot pacemaker_remoted of the snmp1 node afterwards.
>>>>   And I execute crm_resource command, but the snmp1 node remains 
>>  off-line.
>>>> 
>>>>   After having executed crm_resource command, the remote node thinks 
> that 
>>  it is 
>>>>   right movement to become the snmp1 online.
>>>> 
>>>> 
>>>> 
>>>>   Best Regards,
>>>>   Hideo Yamauchi.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>   ----- Original Message -----
>>>>>   From: Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>
>>>>>   To: users at clusterlabs.org; renayama19661014 at ybb.ne.jp
>>>>>   Cc: 
>>>>>   Date: 2015/5/11, Mon 15:39
>>>>>   Subject: Antw: Re: [ClusterLabs] Antw: Re: [Question] About 
>>  movement of 
>>>>   pacemaker_remote.
>>>>> 
>>>>>>>>     <renayama19661014 at ybb.ne.jp> schrieb am 
>>  11.05.2015 um 
>>>>   06:22 
>>>>>   in Nachricht
>>>>>   <361916.15877.qm at web200006.mail.kks.yahoo.co.jp>:
>>>>>>     Hi All,
>>>>>> 
>>>>>>     I matched the OS version of the remote node with a host 
> once 
>>  again and 
>>>> 
>>>>>>     confirmed it in Pacemaker1.1.13-rc2.
>>>>>> 
>>>>>>     It was the same even if I made a host 
> RHEL7.1.(bl460g8n1)
>>>>>>     I made the remote host RHEL7.1.(snmp1)
>>>>>> 
>>>>>>     The first crm_resource -C fails.
>>>>>>     --------------------------------
>>>>>>     [root at bl460g8n1 ~]# crm_resource -C -r snmp1
>>>>>>     Cleaning up snmp1 on bl460g8n1
>>>>>>     Waiting for 1 replies from the CRMd. OK
>>>>>> 
>>>>>>     [root at bl460g8n1 ~]# crm_mon -1 -Af
>>>>>>     Last updated: Mon May 11 12:44:31 2015
>>>>>>     Last change: Mon May 11 12:43:30 2015
>>>>>>     Stack: corosync
>>>>>>     Current DC: bl460g8n1 - partition WITHOUT quorum
>>>>>>     Version: 1.1.12-7a2e3ae
>>>>>>     2 Nodes configured
>>>>>>     3 Resources configured
>>>>>> 
>>>>>> 
>>>>>>     Online: [ bl460g8n1 ]
>>>>>>     RemoteOFFLINE: [ snmp1 ]
>>>>> 
>>>>>   So your host and you resource are both named 
> "snmp1"? I 
>>  also 
>>>>   don't 
>>>>>   have much experience with cleaning up resources for a node 
> that is 
>>  offline. 
>>>>   What 
>>>>>   change should it make (while the node is offline)?
>>>>> 
>>>>>> 
>>>>>>     Host-rsc1      (ocf::heartbeat:Dummy): Started bl460g8n1
>>>>>>     Remote-rsc1    (ocf::heartbeat:Dummy): Started bl460g8n1 
> 
>>  (failure 
>>>>   ignored)
>>>>>> 
>>>>>>     Node Attributes:
>>>>>>     * Node bl460g8n1:
>>>>>>        + ringnumber_0                      : 192.168.101.21 
> is UP
>>>>>>        + ringnumber_1                      : 192.168.102.21 
> is UP
>>>>>> 
>>>>>>     Migration summary:
>>>>>>     * Node bl460g8n1:
>>>>>>       snmp1: migration-threshold=1 fail-count=1000000 
>>>>   last-failure='Mon 
>>>>>   May 11 
>>>>>>     12:44:28 2015'
>>>>>> 
>>>>>>     Failed actions:
>>>>>>        snmp1_start_0 on bl460g8n1 'unknown error' 
> (1): 
>>  call=5, 
>>>>>   status=Timed 
>>>>>>     Out, exit-reason='none', last-rc-change='Mon 
> May 
>>  11 
>>>>   12:43:31 
>>>>>   2015', queued=0ms, 
>>>>>>     exec=0ms
>>>>>>     --------------------------------
>>>>>> 
>>>>>> 
>>>>>>     The second crm_resource -C succeeded and was connected 
> to the 
>>  remote 
>>>>   host.
>>>>> 
>>>>>   Then the node was online it seems.
>>>>> 
>>>>>   Regards,
>>>>>   Ulrich
>>>>> 
>>>>>>     --------------------------------
>>>>>>     [root at bl460g8n1 ~]# crm_mon -1 -Af
>>>>>>     Last updated: Mon May 11 12:44:54 2015
>>>>>>     Last change: Mon May 11 12:44:48 2015
>>>>>>     Stack: corosync
>>>>>>     Current DC: bl460g8n1 - partition WITHOUT quorum
>>>>>>     Version: 1.1.12-7a2e3ae
>>>>>>     2 Nodes configured
>>>>>>     3 Resources configured
>>>>>> 
>>>>>> 
>>>>>>     Online: [ bl460g8n1 ]
>>>>>>     RemoteOnline: [ snmp1 ]
>>>>>> 
>>>>>>     Host-rsc1      (ocf::heartbeat:Dummy): Started bl460g8n1
>>>>>>     Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1
>>>>>>     snmp1  (ocf::pacemaker:remote):        Started bl460g8n1
>>>>>> 
>>>>>>     Node Attributes:
>>>>>>     * Node bl460g8n1:
>>>>>>        + ringnumber_0                      : 192.168.101.21 
> is UP
>>>>>>        + ringnumber_1                      : 192.168.102.21 
> is UP
>>>>>>     * Node snmp1:
>>>>>> 
>>>>>>     Migration summary:
>>>>>>     * Node bl460g8n1:
>>>>>>     * Node snmp1:
>>>>>>     --------------------------------
>>>>>> 
>>>>>>     The gnutls of a host and the remote node was the next 
>>  version.
>>>>>> 
>>>>>>     gnutls-devel-3.3.8-12.el7.x86_64
>>>>>>     gnutls-dane-3.3.8-12.el7.x86_64
>>>>>>     gnutls-c++-3.3.8-12.el7.x86_64
>>>>>>     gnutls-3.3.8-12.el7.x86_64
>>>>>>     gnutls-utils-3.3.8-12.el7.x86_64
>>>>>> 
>>>>>> 
>>>>>>     Best Regards,
>>>>>>     Hideo Yamauchi.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>     ----- Original Message -----
>>>>>>>     From: "renayama19661014 at ybb.ne.jp" 
>>>>>   <renayama19661014 at ybb.ne.jp>
>>>>>>>     To: Cluster Labs - All topics related to open-source 
> 
>>  clustering 
>>>>>   welcomed 
>>>>>>     <users at clusterlabs.org>
>>>>>>>     Cc: 
>>>>>>>     Date: 2015/4/28, Tue 14:06
>>>>>>>     Subject: Re: [ClusterLabs] Antw: Re: [Question] 
> About 
>>  movement of 
>>>>>>     pacemaker_remote.
>>>>>>> 
>>>>>>>     Hi David,
>>>>>>> 
>>>>>>>     Even if the result changed the remote node to 
> RHEL7.1, it 
>>  was the 
>>>>   same.
>>>>>>> 
>>>>>>> 
>>>>>>>     I try it with a host node of pacemaker as RHEL7.1 
> this 
>>  time.
>>>>>>> 
>>>>>>> 
>>>>>>>     I noticed an interesting phenomenon.
>>>>>>>     The remote node fails in a reconnection in the first 
> 
>>  crm_resource.
>>>>>>>     However, the remote node succeeds in a reconnection 
> in 
>>  the second 
>>>>>>     crm_resource.
>>>>>>> 
>>>>>>>     I think that I have some problem with the point 
> where I 
>>  cut the 
>>>>>   connection 
>>>>>>     with 
>>>>>>>     the remote node first.
>>>>>>> 
>>>>>>>     Best Regards,
>>>>>>>     Hideo Yamauchi.
>>>>>>> 
>>>>>>> 
>>>>>>>     ----- Original Message -----
>>>>>>>>     From: "renayama19661014 at ybb.ne.jp" 
>>>>>>>     <renayama19661014 at ybb.ne.jp>
>>>>>>>>     To: Cluster Labs - All topics related to 
> open-source 
>>>>   clustering 
>>>>>   welcomed 
>>>>>>>     <users at clusterlabs.org>
>>>>>>>>     Cc: 
>>>>>>>>     Date: 2015/4/28, Tue 11:52
>>>>>>>>     Subject: Re: [ClusterLabs] Antw: Re: [Question] 
> About 
>> 
>>>>   movement of 
>>>>>>>     pacemaker_remote.
>>>>>>>> 
>>>>>>>>     Hi David,
>>>>>>>>     Thank you for comments.
>>>>>>>>>     At first glance this looks gnutls related.  
>>  GNUTLS is 
>>>>>   returning -50 
>>>>>>>     during 
>>>>>>>>     receive
>>>>>>>> 
>>>>>>>>>     on the client side (pacemaker's side). 
> -50 
>>  maps to 
>>>>>   'invalid 
>>>>>>>>     request'. >debug: crm_remote_recv_once:  
>   
>>  TLS 
>>>>   receive 
>>>>>   failed: The 
>>>>>>>>     request is invalid. >We treat this error as 
> fatal 
>>  and 
>>>>   destroy 
>>>>>   the 
>>>>>>>     connection. 
>>>>>>>>     I've never encountered
>>>>>>>>>     this error and I don't know what causes 
> it. 
>>  It's 
>>>>>   possible 
>>>>>>>>     there's a bug in
>>>>>>>>>     our gnutls usage... it's also possible 
>>  there's a 
>>>>   bug 
>>>>>   in the 
>>>>>>>     version 
>>>>>>>>     of gnutls
>>>>>>>>>     that is in use as well. 
>>>>>>>>     We built the remote node in RHEL6.5.
>>>>>>>>     Because it may be a problem of gnutls, I confirm 
> it 
>>  in 
>>>>   RHEL7.1.
>>>>>>>> 
>>>>>>>>     Best Regards,
>>>>>>>>     Hideo Yamauchi.
>>>>>>>> 
>>>>>>>>     _______________________________________________
>>>>>>>>     Users mailing list: Users at clusterlabs.org 
>>>>>>>>    http://clusterlabs.org/mailman/listinfo/users 
>>>>>>>> 
>>>>>>>>     Project Home: http://www.clusterlabs.org 
>>>>>>>>     Getting started: 
>>>>>   http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>>>>>>     Bugs: http://bugs.clusterlabs.org 
>>>>>>>> 
>>>>>>> 
>>>>>>>     _______________________________________________
>>>>>>>     Users mailing list: Users at clusterlabs.org 
>>>>>>>    http://clusterlabs.org/mailman/listinfo/users 
>>>>>>> 
>>>>>>>     Project Home: http://www.clusterlabs.org 
>>>>>>>     Getting started: 
>>>>>   http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>>>>>     Bugs: http://bugs.clusterlabs.org 
>>>>>>> 
>>>>>> 
>>>>>>     _______________________________________________
>>>>>>     Users mailing list: Users at clusterlabs.org 
>>>>>>    http://clusterlabs.org/mailman/listinfo/users 
>>>>>> 
>>>>>>     Project Home: http://www.clusterlabs.org 
>>>>>>     Getting started: 
>>>>   http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>>>>     Bugs: http://bugs.clusterlabs.org 
>>>>> 
>>>> 
>>>>   _______________________________________________
>>>>   Users mailing list: Users at clusterlabs.org
>>>>   http://clusterlabs.org/mailman/listinfo/users
>>>> 
>>>>   Project Home: http://www.clusterlabs.org
>>>>   Getting started: 
>>  http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>   Bugs: http://bugs.clusterlabs.org
>>>> 
>>> 
>>>   _______________________________________________
>>>   Users mailing list: Users at clusterlabs.org
>>>   http://clusterlabs.org/mailman/listinfo/users
>>> 
>>>   Project Home: http://www.clusterlabs.org
>>>   Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>   Bugs: http://bugs.clusterlabs.org
>> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>