[Pacemaker] pingd

Fri Sep 3 03:38:55 EDT 2010

On Thu, Sep 02, 2010 at 09:33:59PM +0200, Andrew Beekhof wrote:
> On Thu, Sep 2, 2010 at 4:05 PM, Lars Ellenberg
> <lars.ellenberg at linbit.com> wrote:
> > On Thu, Sep 02, 2010 at 11:00:12AM +0200, Bernd Schubert wrote:
> >> On Thursday, September 02, 2010, Andrew Beekhof wrote:
> >> > On Wed, Sep 1, 2010 at 11:59 AM, Bernd Schubert
> >> > > My proposal is to rip out all network code out of pingd and to add
> >> > > slightly modified files from 'iputils'.
> >> >
> >> > Close, but thats not portable.
> >> > Instead use ocf:pacemaker:ping which goes a step further and ditches
> >> > the daemon piece altogether.
> >>
> >> Hmm, we are already using that for now temporarily. But I don't think the ping
> >> RA is suitable for larger clusters. The ping script RA runs everything
> >> serially and only in intervals when called by lrmd. Now lets assume we have a
> >> 20 node cluster.
> >>
> >> nodes = 20
> >> timeout = 2
> >> attempts = 2
> >>
> >> Makes 80s for a single run with default already rather small timeouts, which
> >> is IMHO a bit large. And with a shell script I don't see a way to improve
> >> that. While we could send the pings in parallel, I have no idea how to lock
> >> the variable of active nodes (active=`expr $active + 1`). I don't think that
> >> the simple sh or even bash have a semaphore or mutex lock. So IMHO, we need a
> >> language that supports that, rewriting the pingd RA is one choice, rewriting
> >> the ping RA into python is another.
> >
> > how about an fping RA ?
> > active=$(fping -a -i 5 -t 250 -B1 -r1 $host_list 2>/dev/null | wc -l)
> >
> > terminates in about 3 seconds for a hostlist of 100 (on the LAN, 29 of
> > which are alive).
> 
> Happy to add if someone writes it :-)

I thought so ;-)
Additional note to whomever is going to:

With fping you can get fancy about "better connectivity",
you are not limited to the measure "number of nodes responding".
You could also use the statistics on packet loss and rtt provided on
stderr for -c or -C mode (example output below, chose what you think is
easier to parse), then do some scoring scheme on average or max packet loss,
rtt, or whatever else makes sense to you.
(If a switch starts dying, it may produce increasing packet loss first...)

Or start a smokeping daemon,
and use the triggers there to change pacemaker attributes.
Uhm, well, thats probably no longer maintainable, though ;-)

# fping -q -i 5 -t 250 -B1 -r2 -C5 -g 10.9.9.50 10.9.9.70
10.9.9.50 : 0.14 0.14 0.16 0.12 0.15
10.9.9.51 : - - - - -
10.9.9.52 : - - - - -
10.9.9.53 : 0.37 0.34 0.36 0.34 0.34
10.9.9.54 : 0.13 0.12 0.13 0.12 0.13
10.9.9.55 : 0.17 0.15 0.16 0.12 0.22
10.9.9.56 : 0.32 0.32 0.31 0.41 0.36
10.9.9.57 : 0.35 0.33 0.32 0.34 0.32
10.9.9.58 : - - - - -
10.9.9.59 : - - - - -
10.9.9.60 : - - - - -
10.9.9.61 : - - - - -
10.9.9.62 : - - - - -
10.9.9.63 : - - - - -
10.9.9.64 : - - - - -
10.9.9.65 : 1.92 0.33 0.33 0.33 0.34
10.9.9.66 : - - - - -
10.9.9.67 : - - - - -
10.9.9.68 : - - - - -
10.9.9.69 : 0.15 0.14 0.17 0.13 0.14
10.9.9.70 : - - - - -

# fping -q -i 5 -t 250 -B1 -r2 -c5 -g 10.9.9.50 10.9.9.70
10.9.9.50 : xmt/rcv/%loss = 5/5/0%, min/avg/max = 0.11/0.13/0.15
10.9.9.51 : xmt/rcv/%loss = 5/0/100%
10.9.9.52 : xmt/rcv/%loss = 5/0/100%
10.9.9.53 : xmt/rcv/%loss = 5/5/0%, min/avg/max = 0.33/0.34/0.37
10.9.9.54 : xmt/rcv/%loss = 5/5/0%, min/avg/max = 0.10/0.11/0.13
10.9.9.55 : xmt/rcv/%loss = 5/5/0%, min/avg/max = 0.13/0.16/0.20
10.9.9.56 : xmt/rcv/%loss = 5/5/0%, min/avg/max = 0.34/0.36/0.41
10.9.9.57 : xmt/rcv/%loss = 5/5/0%, min/avg/max = 0.16/0.25/0.33
10.9.9.58 : xmt/rcv/%loss = 5/0/100%
10.9.9.59 : xmt/rcv/%loss = 5/0/100%
10.9.9.60 : xmt/rcv/%loss = 5/0/100%
10.9.9.61 : xmt/rcv/%loss = 5/0/100%
10.9.9.62 : xmt/rcv/%loss = 5/0/100%
10.9.9.63 : xmt/rcv/%loss = 5/0/100%
10.9.9.64 : xmt/rcv/%loss = 5/0/100%
10.9.9.65 : xmt/rcv/%loss = 5/5/0%, min/avg/max = 0.28/0.32/0.34
10.9.9.66 : xmt/rcv/%loss = 5/0/100%
10.9.9.67 : xmt/rcv/%loss = 5/0/100%
10.9.9.68 : xmt/rcv/%loss = 5/0/100%
10.9.9.69 : xmt/rcv/%loss = 5/5/0%, min/avg/max = 0.13/0.14/0.15
10.9.9.70 : xmt/rcv/%loss = 5/0/100%

> >> So in fact my first proposal also only was the first step - first add better
> >> network code and then to make it multi-threaded - each ping host gets its own
> >> thread.
> >
> > A working pingd daemon has the additional advantage that it can ask its
> > peers for their ping node count, before actually updating the attribute,
> > which should help with the "dampen race".
> 
> That happens at the attrd level in both cases.  pingd adds nothing here.

I thought pingd did the dampening itself, even communicated with its peer
pingd's, and there was no more dampening in attrd involved after that.
But If you say so. I never looked at pingd too closely.

> >> PS: (*) As you insist ;) on quorum with n/2 + 1 nodes, we use ping as
> >> replacement. We simply cannot fulfill n/2 + 1, as controller failure takes
> >> down 50% of the systems (virtual machines) and the systems (VMs) of the 2nd
> >> controller are then supposed to take over failed services. I see that n/2 + 1
> >> is optimal and also required for a few nodes. But if you have a larger set of
> >> system (e.g. minimum 6 with the VM systems I have in my mind) n/2 + 1 is
> >> sufficient, IMHO.
> >
> > You meant to say you consider == n/2 sufficient, instead of > n/2 ?

So you have a two node virtualization stuff, each hosting n/2 VMs,
and do the pacemaker clustering between those VMs?

I'm sure you could easily add "somewhere else" a very bare bone VM
(or real) server, that is dedicated member of your cluster, but
never takes any resources? Just serves as arbitrator? as your "+1"?

May be easier, safer, and more transparent than
no-quorum=ignore plus some ping attribute based auto-shutdown.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.