[ClusterLabs] Is "Process pause detected" triggered too easily?
Jean-Marc Saffroy
saffroy at gmail.com
Wed Sep 27 11:39:53 EDT 2017
On Wed, 27 Sep 2017, Jan Friesse wrote:
> I don't think scheduling is the case. If scheduler would be the case
> other message (Corosync main process was not scheduled for ...) would
> kick in. This looks more like a something is blocked in totemsrp.
Ah, interesting!
> > Also, it looks like the side effect is that corosync drops important
> > messages (I think "join" messages?), and I fear that this can lead to
>
> You mean membership join messages? Because there are a lot (327) of them
> in log you've sent.
Yes. In my test setup I didn't see any issue where we lost membership join
messages, but the reason why I am looking into this is this:
We had one problem on a real deployment of DLM+corosync (5 voters and 20
non-voters, with dlm on those 20, for a specific application that uses
libdlm). On a reboot of one server running just corosync (which thus did
NOT run dlm), a large number of other servers got briefly evicted from the
corosync ring; and when rejoining, dlm complained about a "stateful merge"
which forces a reboot. Note, dlm fencing is disabled.
In that system, it was "legal" for corosync to kick out these servers
(they had zero vote), but it was highly unexpected (they were running
fine) and the impact is high (reboot).
We did see "Process pause detected" in the logs on that system when the
incident happened, which is why I think could be a clue.
> I'll definitively try to reproduce this bug and let you know. I don't
> think any message get lost, but it's better to be on a safe side.
Thanks!
Cheers,
JM
--
saffroy at gmail.com
More information about the Users
mailing list