[ClusterLabs] Resource failure-timeout does not reset when resource fails to connect to both nodes
Digimer
lists at alteeve.ca
Mon Mar 28 17:32:08 UTC 2016
On 28/03/16 12:44 PM, Sam Gardner wrote:
> I have a simple resource defined:
>
> [root at ha-d1 ~]# pcs resource show dmz1
> Resource: dmz1 (class=ocf provider=internal type=ip-address)
> Attributes: address=172.16.10.192 monitor_link=true
> Meta Attrs: migration-threshold=3 failure-timeout=30s
> Operations: monitor interval=7s (dmz1-monitor-interval-7s)
>
> This is a custom resource which provides an ethernet alias to one of the
> interfaces on our system.
>
> I can unplug the cable on either node and failover occurs as expected,
> and 30s after re-plugging it I can repeat the exercise on the opposite
> node and failover will happen as expected.
>
> However, if I unplug the cable from both nodes, the failcount goes up,
> and the 30s failure-timeout does not reset the failcounts, meaning that
> pacemaker never tries to start the failed resource again.
>
> Full list of resources:
>
> Resource Group: network
> inif (off::internal:ip.sh): Started ha-d1.dev.com
> outif (off::internal:ip.sh): Started ha-d2.dev.com
> dmz1 (off::internal:ip.sh): Stopped
> Master/Slave Set: DRBDMaster [DRBDSlave]
> Masters: [ ha-d1.dev.com ]
> Slaves: [ ha-d2.dev.com ]
> Resource Group: filesystem
> DRBDFS (ocf::heartbeat:Filesystem): Stopped
> Resource Group: application
> service_failover (off::internal:service_failover): Stopped
>
> Failcounts for dmz1
> ha-d1.dev.com: 4
> ha-d2.dev.com: 4
>
> Is there any way to automatically recover from this scenario, other than
> setting an obnoxiously high migration-threshold?
>
> --
>
> *Sam Gardner *
>
> Software Engineer
>
> *Trustwave** *| SMART SECURITY ON DEMAND
Stonith?
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
More information about the Users
mailing list