[Pacemaker] designing a load balancer - request for comments
Klaus Darilion
klaus.mailinglists at pernau.at
Mon Feb 14 13:37:41 UTC 2011
Am 11.02.2011 16:13, schrieb Raoul Bhatia [IPAX]:
> On 02/11/2011 03:07 PM, Klaus Darilion wrote:
...
>> Or, how should pacemaker behave if Kamailio on the active node crashes.
>> Shall it just restart Kamailio or shall it migrate the IP address to the
>> other node and then try to restart Kamailio on the inactive node?
>
> pacemaker will not endlessly try to restart the configured resources:
> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-failure-migration.html
>
> pacemaker can be configured to restart a resource e.g. for a couple of
> times and if this does not work, it will migrate to another host.
> (you can also configure pacemaker to migrate upon the first failure)
Somehow pacemaker does not react as I would expect it. My config is:
primitive failover-ip ocf:heartbeat:IPaddr \
params ip="83.136.32.161" \
op monitor interval="3s"
primitive kamailio lsb:kamailio \
meta migration-threshold="2" failure-timeout="60" \
op monitor interval="15" timeout="15"
clone cloneKamailio kamailio
colocation colo_ip_with_kamailio inf: failover-ip cloneKamailio
property $id="cib-bootstrap-options" \
dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="false" \
no-quorum-policy="ignore"
rsc_defaults $id="rsc-options" \
resource-stickiness="5"
At the beginning it is:
Operations:
* Node server1:
kamailio:0: migration-threshold=2
+ (4) start: rc=0 (ok)
+ (5) monitor: interval=15000ms rc=0 (ok)
* Node server2:
kamailio:1: migration-threshold=2
+ (4) start: rc=0 (ok)
+ (5) monitor: interval=15000ms rc=0 (ok)
failover-ip: migration-threshold=1000000
+ (6) start: rc=0 (ok)
+ (7) monitor: interval=3000ms rc=0 (ok)
Then I stop Kamailio manually on server2. After some seconds pacemaker
detects that Kamailio is not running and restarts it:
Operations:
* Node server1:
kamailio:0: migration-threshold=2
+ (4) start: rc=0 (ok)
+ (5) monitor: interval=15000ms rc=0 (ok)
* Node server2:
kamailio:1: migration-threshold=2 fail-count=1 last-failure='Mon Feb
14 13:08:52 2011'
+ (8) stop: rc=0 (ok)
+ (9) start: rc=0 (ok)
+ (10) monitor: interval=15000ms rc=0 (ok)
failover-ip: migration-threshold=1000000
+ (6) start: rc=0 (ok)
+ (7) monitor: interval=3000ms rc=0 (ok)
Then I wait a few minutes but the fail-count is still 1, although I
would expect that the timeout should clear the fail-count.
Then I stop Kamailio again. Pacemaker detects that Kamailio is not
running, increases the failure-count and migrates to other server.
(Kamailio is not restarted)
Operations:
* Node server1:
kamailio:0: migration-threshold=2
+ (4) start: rc=0 (ok)
+ (5) monitor: interval=15000ms rc=0 (ok)
failover-ip: migration-threshold=1000000
+ (6) start: rc=0 (ok)
+ (7) monitor: interval=3000ms rc=0 (ok)
* Node server2:
kamailio:1: migration-threshold=2 fail-count=2 last-failure='Mon Feb
14 13:30:23 2011'
+ (9) start: rc=0 (ok)
+ (10) monitor: interval=15000ms rc=7 (not running)
+ (12) stop: rc=0 (ok)
failover-ip: migration-threshold=1000000
+ (6) start: rc=0 (ok)
+ (7) monitor: interval=3000ms rc=0 (ok)
+ (11) stop: rc=0 (ok)
Failed actions:
kamailio:1_monitor_15000 (node=server2, call=10, rc=7,
status=complete): not running
Then I wait a few minutes but the fail-count is still 2 and Kamailio is
still not restarted. From the documentation I would expect that the
failure-count would be reseted after failure-timeout="60" and Kamailio
should be started again on server2.
After 4 minutes Kamailio is restarted again, but the fail-count is still 2:
Operations:
* Node server1:
kamailio:0: migration-threshold=2
+ (4) start: rc=0 (ok)
+ (5) monitor: interval=15000ms rc=0 (ok)
failover-ip: migration-threshold=1000000
+ (6) start: rc=0 (ok)
+ (7) monitor: interval=3000ms rc=0 (ok)
* Node server2:
kamailio:1: migration-threshold=2 fail-count=2 last-failure='Mon Feb
14 13:30:23 2011'
+ (12) stop: rc=0 (ok)
+ (13) start: rc=0 (ok)
+ (14) monitor: interval=15000ms rc=0 (ok)
failover-ip: migration-threshold=1000000
+ (6) start: rc=0 (ok)
+ (7) monitor: interval=3000ms rc=0 (ok)
+ (11) stop: rc=0 (ok)
So, what am I doing wrong? I would expect that after 60s the
failure-count is resetted.
Thanks
Klaus
More information about the Pacemaker
mailing list