[Pacemaker] pingd process dies for no reason

Tue Jan 11 13:45:05 UTC 2011

On Tue, Jan 11, 2011 at 11:24:35AM +0100, Patrik.Rapposch at knapp.com wrote:
> we already made changes to the interval and timeout (<op 
> id="pingd-op-monitor-30s" interval="30s" name="monitor" timeout="10s"/>).
> 
> how big should dampen be set?
> 
> please correct me, if i am wrong, as i calculate it as following:
> assuming the last check was ok and in the next second, the failures takes 
> place:
> then we there would be 29s till the next check will start, and another 10 
> seconds timeout, plus 5 seconds dampen. this would be 44 seconds, isn't 
> that enough?

I think "dampen" needs to be larger than the monitoring interval.
And the timeout on the operation should be large enough that
ping, even if the remote is unreachable for the first time,
will timeout by itself (and not killed prematurely by lrmd because
the operation timeout elapsed).

try with interval 15s, dampen 20,
  instance parameter timeout: something explicit, if you want to.
  instance parameter attempts: something explicit, if you want to.
 monitor operation timeout=60s 

BTW, someone should really implement the fping based ping RA ...
Or did I miss it?

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.