[Pacemaker] pingd

Thu Sep 2 15:41:33 UTC 2010

On Thursday, September 02, 2010, Lars Ellenberg wrote:
> On Thu, Sep 02, 2010 at 11:00:12AM +0200, Bernd Schubert wrote:
> > On Thursday, September 02, 2010, Andrew Beekhof wrote:
> > > On Wed, Sep 1, 2010 at 11:59 AM, Bernd Schubert
> > > 
> > > > My proposal is to rip out all network code out of pingd and to add
> > > > slightly modified files from 'iputils'.
> > > 
> > > Close, but thats not portable.
> > > Instead use ocf:pacemaker:ping which goes a step further and ditches
> > > the daemon piece altogether.
> > 
> > Hmm, we are already using that for now temporarily. But I don't think the
> > ping RA is suitable for larger clusters. The ping script RA runs
> > everything serially and only in intervals when called by lrmd. Now lets
> > assume we have a 20 node cluster.
> > 
> > nodes = 20
> > timeout = 2
> > attempts = 2
> > 
> > Makes 80s for a single run with default already rather small timeouts,
> > which is IMHO a bit large. And with a shell script I don't see a way to
> > improve that. While we could send the pings in parallel, I have no idea
> > how to lock the variable of active nodes (active=`expr $active + 1`). I
> > don't think that the simple sh or even bash have a semaphore or mutex
> > lock. So IMHO, we need a language that supports that, rewriting the
> > pingd RA is one choice, rewriting the ping RA into python is another.
> 
> how about an fping RA ?
> active=$(fping -a -i 5 -t 250 -B1 -r1 $host_list 2>/dev/null | wc -l)

Oh cool, I didn't know about fping at all yet :) From the man page 

"In the default mode, if a target replies, it is noted and removed from the 
list of targets to check; if  a  target  does  not  respond within a certain  
time  limit and/or retry limit it is designated as unreachable. fping also 
supports sending a specified number of pings to a target, or looping 
indefinitely (as in ping ).

Unlike ping, fping is meant to be used in scripts, so its output is designed 
to be easy to parse."

That indeed is an option.

> 
> terminates in about 3 seconds for a hostlist of 100 (on the LAN, 29 of
> which are alive).
> 
> > So in fact my first proposal also only was the first step - first add
> > better network code and then to make it multi-threaded - each ping host
> > gets its own thread.
> 
> A working pingd daemon has the additional advantage that it can ask its
> peers for their ping node count, before actually updating the attribute,
> which should help with the "dampen race".
> 
> > Another reason why I don't like the shell RA too much is that shell takes
> > a considerable amount of CPU time. For a subset of systems where we need
> > ping as replacement for quorum policy (*) CPU time is precious.
> > 
> > Thanks,
> > Bernd
> > 
> > PS: (*) As you insist ;) on quorum with n/2 + 1 nodes, we use ping as
> > replacement. We simply cannot fulfill n/2 + 1, as controller failure
> > takes down 50% of the systems (virtual machines) and the systems (VMs)
> > of the 2nd controller are then supposed to take over failed services. I
> > see that n/2 + 1 is optimal and also required for a few nodes. But if
> > you have a larger set of system (e.g. minimum 6 with the VM systems I
> > have in my mind) n/2 + 1 is sufficient, IMHO.
> 
> You meant to say you consider == n/2 sufficient, instead of > n/2 ?

Oh sorry, yes, that was what I meant to write.

-- 
Bernd Schubert
DataDirect Networks