[ClusterLabs] STONITH when both IB interfaces are down, and how to trigger Filesystem mount/umount failure to test STONITH?
Andrei Borzenkov
arvidjaar at gmail.com
Thu Aug 20 07:00:32 UTC 2015
19.08.2015 13:31, Marcin Dulak пишет:
> However if instead both IPoIB interfaces go down on server-02,
> the mdt is moved to server-01, but no STONITH is performed on server-02.
> This is expected, because there is nothing in the configuration that
> triggers
> STONITH in case of IB connection loss.
> Hovever if IPoIB is flapping this setup could lead to mdt moving
> back and forth between server-01 and server-02.
> Should I have STONITH shutting down a node that misses both IpoIB
> (remember they are passively redundant, only one active at a time)
> interfaces?
It is really up to the agent. Note that on-fail is triggered only if
operation fails. So as long as stop invocation does not return error, no
fencing happens.
> If so, how to achieve that?
>
If you really want to trigger fencing when access to block device
fails you probably need to define it as separate resource with own
agent and set on-fail=fence on monitor operation for this block
device. Otherwise you cannot really distinguish fiesystem level error
from block device level.
> The context for the second question: the configuration contains the
> following Filesystem template:
>
> rsc_template lustre-target-template ocf:heartbeat:Filesystem \
> op monitor interval=120 timeout=60 OCF_CHECK_LEVEL=10 \
> op start interval=0 timeout=300 on-fail=fence \
> op stop interval=0 timeout=300 on-fail=fence
>
> How can I make umount/mount of Filesystem fail in order to test STONITH
> action in these cases?
>
Insert "exit $OCF_ERR_GENERIC" in stop method? :)
> Extra question: where can I find the documentation/source what
> on-fail=fence is doing?
Pacemaker Explained has some description. It should initiate fencing
of node where resource had been active.
> Or what does it mean on-fail=stop in the ethmonitor template below (what is
> stopped?)?
>
on-fail=stop sets resource target role to stopped. So pacemaker tries
to stop it and leave it stopped.
> rsc_template netmonitor-30sec ethmonitor \
> params repeat_count=3 repeat_interval=10 \
> op monitor interval=15s timeout=60s \
> op start interval=0s timeout=60s on-fail=stop \
>
> Marcin
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
More information about the Users
mailing list