[Pacemaker] pacemaker-remote tls handshaking

Fri May 24 18:06:35 UTC 2013

----- Original Message -----
> From: "David Vossel" <dvossel at redhat.com>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Thursday, May 23, 2013 11:21:33 PM
> Subject: Re: [Pacemaker] pacemaker-remote tls handshaking
> 
> ----- Original Message -----
> > From: "Lindsay Todd" <rltodd.ml1 at gmail.com>
> > To: "The Pacemaker cluster resource manager"
> > <pacemaker at oss.clusterlabs.org>
> > Sent: Thursday, May 23, 2013 4:35:02 PM
> > Subject: Re: [Pacemaker] pacemaker-remote tls handshaking
> > 
> > Working on this problem further...
> > 
> > On Tue, May 21, 2013 at 5:14 PM, David Vossel <dvossel at redhat.com> wrote:
> > > I'd suggest this.  Try running the pacemaker_remote regression test and
> > > see
> > > what happens.  This will start up
> > > an instance of pacemaker_remote locally and issue client commands to it
> > > to
> > > test both the TLS connection and
> > > the ability to start/stop/monitor services.
> > >
> > > /usr/share/pacemaker/tests/lrmd/regression.py  -R
> > 
> > But sadly SL 6.4 doesn't have the systemctl commands this is trying to
> 
> oops
> 
> > use.  (Also I am building RPMs and installing those, the lrmd
> > regression tests aren't included in pacemaker-cts.
> 
> another oops
> 
> > No problem, I ran
> > directly from the build directory.)  It doesn't seem to make much
> > progress.  The stdout is:
> > 
> >     sh: systemctl: command not found
> >     sh: /lib/systemd/system/lrmd_dummy_daemon.service: No such file or
> >     directory
> >     sh: systemctl: command not found
> >     Starting ...
> > 
> > And the lrmd-regression.log has:
> >     Set r/w permissions for uid=496, gid=494 on /tmp/lrmd-regression.log
> >     May 23 15:14:39 [3610] swbuildsl6 pacemaker_remoted:     info:
> > qb_ipcs_us_publish:      server name: lrmd
> >     May 23 15:14:39 [3610] swbuildsl6 pacemaker_remoted:   notice:
> > lrmd_init_remote_tls_server:     Starting a tls listener on port 3121.
> >     May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted:     info:
> > qb_ipcs_us_publish:      server name: cib_ro
> >     May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted:     info:
> > qb_ipcs_us_publish:      server name: cib_rw
> >     May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted:     info:
> > qb_ipcs_us_publish:      server name: cib_shm
> >     May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted:     info:
> > qb_ipcs_us_publish:      server name: attrd
> >     May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted:     info:
> > qb_ipcs_us_publish:      server name: stonith-ng
> >     May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted:     info:
> > qb_ipcs_us_publish:      server name: crmd
> >     May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted:     info:
> > main:    Starting
> > 
> > 
> > > By default, the connection should retry for 60 seconds after the vm
> > > resource starts.  Like you've noticed, this
> > > can be extended to account for vms that take longer to boot.
> > 
> > But maybe this should start after the monitor method for the VM first
> > indicates success?  Or does it already?
> 
> The policy engine has no way of expressing this right now. It would be
> difficult to make this happen.  Likely your idea of additional start scripts
> to verify when the VM's network is actually available would be a better
> choice.
> 
> > 
> > >> There have been a few segfaults of crmd during my testing of this, so
> > >> perhaps
> > >> there is a memory smash somewhere. (A couple times the failure was at
> > >> remote_lrmd_ra.c:186,
> > >
> > > Please provide gdb backtrace.  We need to get this resolved asap before
> > > the
> > > release of v.1.1.10 is complete.
> > > I believe there is a new rc in the works already.
> > 
> > So I've attached results from a few core dumps.  All were triggered
> > using "crm resource cleanup swbuildsl6" where swbuildsl6 is the host
> > name of the VM  (that I can still telnet to port 3121).
> 
> thanks :)
> 
> > >> > I doubt this will make a difference, but here's the key I use during
> > >> > testing,
> > >> > lrmd:ce9db0bc3cec583d3b3bf38b0ac9ff91
> > 
> > It makes no difference.  I had wondered if the shorter key would matter.
> > 
> > Also, I've attached some patches I made to 1.1.10rc3 to try to resolve
> > this problem.  So far no success.  Some of these add logging; the
> > others are fix what look to me to be fishy code with cases that aren't
> > completely handled.  With the additional logging, I see these results
> > being logged:
> > 
> >     May 23 17:06:51 swbuildsl6 pacemaker_remoted[2326]:   notice:
> > lrmd_remote_listen: LRMD client connection established. 0x995250 id:
> > df04d8ee-7fcb-4025-8c8f-8a1555a4d097
> >     May 23 17:06:53 cvmh02 crmd[18982]:  warning: lrmd_tcp_connect_cb:
> > Client tls handshake failed for server swbuildsl6:3121. Disconnecting
> >     May 23 17:06:52 swbuildsl6 pacemaker_remoted[2326]:    error:
> > lrmd_remote_client_msg: Remote lrmd tls handshake failed: -9
> >     May 23 17:06:52 swbuildsl6 pacemaker_remoted[2326]:   notice:
> > lrmd_remote_client_destroy: LRMD client disconnecting remote client -
> > name: <unknown> id: df04d8ee-7fcb-4025-8c8f-8a1555a4d097
> > 
> > Puzzling -- nothing being logged from
> > crm_initiate_client_tls_handshake -- is there something I need to add
> > to somehow activate the crm_err and crm_info calls?
> 
> Well, you've definitely gotten my attention.  I tried this on my rhel 6 box
> and sure enough, I'm seeing the exact same thing you're seeing.  No worries.
> I'll track this down.  I'm sure it has to do with the gnutls version being
> used.

I figured it out. It's a gnutls bug I believe. The old gnutls library version doesn't like the way I'm setting the psk credentials (which makes the handshake fail) I have a work-around I'm implementing now. I'll have a patch by Tuesday.

-- Vossel

> 
> In the mean time, if you want to test this feature, it does work in Fedora
> 18.  Thanks for all your work on testing this.  You're feedback came just in
> time. We are about to release 1.1.10 soon :)
> 
> -- Vossel
> 
> > /rlt
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> > 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>