[Pacemaker] pingd

Fri Sep 3 11:12:18 UTC 2010

On Fri, Sep 3, 2010 at 9:38 AM, Lars Ellenberg
<lars.ellenberg at linbit.com> wrote:
> On Thu, Sep 02, 2010 at 09:33:59PM +0200, Andrew Beekhof wrote:
>> On Thu, Sep 2, 2010 at 4:05 PM, Lars Ellenberg
>> <lars.ellenberg at linbit.com> wrote:
>> > On Thu, Sep 02, 2010 at 11:00:12AM +0200, Bernd Schubert wrote:
>> >> On Thursday, September 02, 2010, Andrew Beekhof wrote:
>> >> > On Wed, Sep 1, 2010 at 11:59 AM, Bernd Schubert
>> >> > > My proposal is to rip out all network code out of pingd and to add
>> >> > > slightly modified files from 'iputils'.
>> >> >
>> >> > Close, but thats not portable.
>> >> > Instead use ocf:pacemaker:ping which goes a step further and ditches
>> >> > the daemon piece altogether.
>> >>
>> >> Hmm, we are already using that for now temporarily. But I don't think the ping
>> >> RA is suitable for larger clusters. The ping script RA runs everything
>> >> serially and only in intervals when called by lrmd. Now lets assume we have a
>> >> 20 node cluster.
>> >>
>> >> nodes = 20
>> >> timeout = 2
>> >> attempts = 2
>> >>
>> >> Makes 80s for a single run with default already rather small timeouts, which
>> >> is IMHO a bit large. And with a shell script I don't see a way to improve
>> >> that. While we could send the pings in parallel, I have no idea how to lock
>> >> the variable of active nodes (active=`expr $active + 1`). I don't think that
>> >> the simple sh or even bash have a semaphore or mutex lock. So IMHO, we need a
>> >> language that supports that, rewriting the pingd RA is one choice, rewriting
>> >> the ping RA into python is another.
>> >
>> > how about an fping RA ?
>> > active=$(fping -a -i 5 -t 250 -B1 -r1 $host_list 2>/dev/null | wc -l)
>> >
>> > terminates in about 3 seconds for a hostlist of 100 (on the LAN, 29 of
>> > which are alive).
>>
>> Happy to add if someone writes it :-)
>
> I thought so ;-)
> Additional note to whomever is going to:
>
> With fping you can get fancy about "better connectivity",
> you are not limited to the measure "number of nodes responding".
> You could also use the statistics on packet loss and rtt provided on
> stderr for -c or -C mode (example output below, chose what you think is
> easier to parse), then do some scoring scheme on average or max packet loss,
> rtt, or whatever else makes sense to you.
> (If a switch starts dying, it may produce increasing packet loss first...)

This sounds great.
I think we want the ping RA to use fping where available.

>
> Or start a smokeping daemon,
> and use the triggers there to change pacemaker attributes.
> Uhm, well, thats probably no longer maintainable, though ;-)
>
> # fping -q -i 5 -t 250 -B1 -r2 -C5 -g 10.9.9.50 10.9.9.70
> 10.9.9.50 : 0.14 0.14 0.16 0.12 0.15
> 10.9.9.51 : - - - - -
> 10.9.9.52 : - - - - -
> 10.9.9.53 : 0.37 0.34 0.36 0.34 0.34
> 10.9.9.54 : 0.13 0.12 0.13 0.12 0.13
> 10.9.9.55 : 0.17 0.15 0.16 0.12 0.22
> 10.9.9.56 : 0.32 0.32 0.31 0.41 0.36
> 10.9.9.57 : 0.35 0.33 0.32 0.34 0.32
> 10.9.9.58 : - - - - -
> 10.9.9.59 : - - - - -
> 10.9.9.60 : - - - - -
> 10.9.9.61 : - - - - -
> 10.9.9.62 : - - - - -
> 10.9.9.63 : - - - - -
> 10.9.9.64 : - - - - -
> 10.9.9.65 : 1.92 0.33 0.33 0.33 0.34
> 10.9.9.66 : - - - - -
> 10.9.9.67 : - - - - -
> 10.9.9.68 : - - - - -
> 10.9.9.69 : 0.15 0.14 0.17 0.13 0.14
> 10.9.9.70 : - - - - -
>
> # fping -q -i 5 -t 250 -B1 -r2 -c5 -g 10.9.9.50 10.9.9.70
> 10.9.9.50 : xmt/rcv/%loss = 5/5/0%, min/avg/max = 0.11/0.13/0.15
> 10.9.9.51 : xmt/rcv/%loss = 5/0/100%
> 10.9.9.52 : xmt/rcv/%loss = 5/0/100%
> 10.9.9.53 : xmt/rcv/%loss = 5/5/0%, min/avg/max = 0.33/0.34/0.37
> 10.9.9.54 : xmt/rcv/%loss = 5/5/0%, min/avg/max = 0.10/0.11/0.13
> 10.9.9.55 : xmt/rcv/%loss = 5/5/0%, min/avg/max = 0.13/0.16/0.20
> 10.9.9.56 : xmt/rcv/%loss = 5/5/0%, min/avg/max = 0.34/0.36/0.41
> 10.9.9.57 : xmt/rcv/%loss = 5/5/0%, min/avg/max = 0.16/0.25/0.33
> 10.9.9.58 : xmt/rcv/%loss = 5/0/100%
> 10.9.9.59 : xmt/rcv/%loss = 5/0/100%
> 10.9.9.60 : xmt/rcv/%loss = 5/0/100%
> 10.9.9.61 : xmt/rcv/%loss = 5/0/100%
> 10.9.9.62 : xmt/rcv/%loss = 5/0/100%
> 10.9.9.63 : xmt/rcv/%loss = 5/0/100%
> 10.9.9.64 : xmt/rcv/%loss = 5/0/100%
> 10.9.9.65 : xmt/rcv/%loss = 5/5/0%, min/avg/max = 0.28/0.32/0.34
> 10.9.9.66 : xmt/rcv/%loss = 5/0/100%
> 10.9.9.67 : xmt/rcv/%loss = 5/0/100%
> 10.9.9.68 : xmt/rcv/%loss = 5/0/100%
> 10.9.9.69 : xmt/rcv/%loss = 5/5/0%, min/avg/max = 0.13/0.14/0.15
> 10.9.9.70 : xmt/rcv/%loss = 5/0/100%
>
>> >> So in fact my first proposal also only was the first step - first add better
>> >> network code and then to make it multi-threaded - each ping host gets its own
>> >> thread.
>> >
>> > A working pingd daemon has the additional advantage that it can ask its
>> > peers for their ping node count, before actually updating the attribute,
>> > which should help with the "dampen race".
>>
>> That happens at the attrd level in both cases.  pingd adds nothing here.
>
> I thought pingd did the dampening itself, even communicated with its peer
> pingd's, and there was no more dampening in attrd involved after that.

Nope. All in attrd.  I dont like writing things twice :-)

> But If you say so. I never looked at pingd too closely.
>
>> >> PS: (*) As you insist ;) on quorum with n/2 + 1 nodes, we use ping as
>> >> replacement. We simply cannot fulfill n/2 + 1, as controller failure takes
>> >> down 50% of the systems (virtual machines) and the systems (VMs) of the 2nd
>> >> controller are then supposed to take over failed services. I see that n/2 + 1
>> >> is optimal and also required for a few nodes. But if you have a larger set of
>> >> system (e.g. minimum 6 with the VM systems I have in my mind) n/2 + 1 is
>> >> sufficient, IMHO.
>> >
>> > You meant to say you consider == n/2 sufficient, instead of > n/2 ?
>
> So you have a two node virtualization stuff, each hosting n/2 VMs,
> and do the pacemaker clustering between those VMs?
>
> I'm sure you could easily add "somewhere else" a very bare bone VM
> (or real) server, that is dedicated member of your cluster, but
> never takes any resources? Just serves as arbitrator? as your "+1"?
>
> May be easier, safer, and more transparent than
> no-quorum=ignore plus some ping attribute based auto-shutdown.
>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
>
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>