[Pacemaker] OCFS2 problems when connectivity lost
Ivan Savčić | Epix
ivan.savcic at epix.rs
Wed Dec 21 16:35:10 UTC 2011
On 21.12.2011 13:07, Tim Serong wrote:
> My guess would be:
>
> The filesystem can't stop on the non-quorate node, because the network
> connection is down, so DLM can't do its thing.
Ok.
> The filesystem is probably frozen on the quorate node, because of loss
> of DLM comms.
Ok, same problem as above then.
> If STONITH is configured, the non-quorate node should be killed after a
> failed (or timed out) stop, and the quorate node should resume behaving
> normally.
>
> HTH,
>
> Tim
But lost DLM comm leads to *both* nodes hanging: the one in the process
of being shut down by Pacemaker (because of lost quorum) and the one
which is in the partition with quorum (and thus should live).
My point is that at least one OCFS2 node (the one in partition with
quorum) should somehow survive the lost comm and stay healthy, but DLM
(or something else) gets "stuck" and they both hang. That's the problem.
Ivan
More information about the Pacemaker
mailing list