[Pacemaker] Issues with fence and corosync crash

Tue Jan 4 17:19:42 UTC 2011

Hi,

On Fri, Dec 24, 2010 at 12:05:27PM +0100, Simone Felici wrote:
>
> Hi to all!
>
> I've an issue with my cluster env. First of all my config:
>
> Two Cluster CentOS5.5 Active+Standby with one DRBD partition managing a Nagios service, ip, and storage.
> The config files at the bottom.
>
> I'm trying to test fence option to prevent split brain and problems on double access on drbd partition.
> Starting on a sane situation, manual switching of the resources or 
> simulating kernel-panic, crash of process or whatever, all works well. If 
> I try to shutdown the eth1 (192.168.100.0 as well as cross cable to drbd 
> mirroring) the active stay as it is, it calls the fence option adding the 
> entry to crm config:
> location drbd-fence-by-handler-ServerData ServerData \
>         rule $id="drbd-fence-by-handler-rule-ServerData" $role="Master" -inf: #uname ne opsview-core01-tn
>
> But the standby node kills the corosync process:

How? Did the corosync process crash (looks like it)? Did you
find any core dumps?

> *** STANDBY NODE LOG ***
> Dec 24 11:00:04 corosync [TOTEM ] Incrementing problem counter for seqid 14158 iface 192.168.100.12 to [1 of 10]
> Dec 24 11:00:04 corosync [TOTEM ] Incrementing problem counter for seqid 14160 iface 192.168.100.12 to [2 of 10]
> Dec 24 11:00:05 corosync [TOTEM ] Incrementing problem counter for seqid 14162 iface 192.168.100.12 to [3 of 10]
> Dec 24 11:00:05 corosync [TOTEM ] Incrementing problem counter for seqid 14164 iface 192.168.100.12 to [4 of 10]
> Dec 24 11:00:06 corosync [TOTEM ] Decrementing problem counter for iface 192.168.100.12 to [3 of 10]
> Dec 24 11:00:06 corosync [TOTEM ] Incrementing problem counter for seqid 14166 iface 192.168.100.12 to [4 of 10]
> Dec 24 11:00:06 corosync [TOTEM ] Incrementing problem counter for seqid 14168 iface 192.168.100.12 to [5 of 10]
> Dec 24 11:00:07 corosync [TOTEM ] Incrementing problem counter for seqid 14170 iface 192.168.100.12 to [6 of 10]
> Dec 24 11:00:08 corosync [TOTEM ] Incrementing problem counter for seqid 14172 iface 192.168.100.12 to [7 of 10]
> Dec 24 11:00:08 corosync [TOTEM ] Decrementing problem counter for iface 192.168.100.12 to [6 of 10]
> Dec 24 11:00:08 corosync [TOTEM ] Incrementing problem counter for seqid 14174 iface 192.168.100.12 to [7 of 10]
> Dec 24 11:00:09 corosync [TOTEM ] Incrementing problem counter for seqid 14176 iface 192.168.100.12 to [8 of 10]
> Dec 24 11:00:09 corosync [TOTEM ] Incrementing problem counter for seqid 14178 iface 192.168.100.12 to [9 of 10]
> Dec 24 11:00:10 corosync [TOTEM ] Decrementing problem counter for iface 192.168.100.12 to [8 of 10]
> Dec 24 11:00:10 corosync [TOTEM ] Incrementing problem counter for seqid 14180 iface 192.168.100.12 to [9 of 10]
> Dec 24 11:00:10 corosync [TOTEM ] Incrementing problem counter for seqid 14182 iface 192.168.100.12 to [10 of 10]
> Dec 24 11:00:10 corosync [TOTEM ] Marking seqid 14182 ringid 0 interface 192.168.100.12 FAULTY - adminisrtative intervention required.
> Dec 24 11:00:11 corosync [TOTEM ] FAILED TO RECEIVE
> Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
> Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
> Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
> Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
> Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
> Dec 24 11:00:13 corosync [TOTEM ] FAILED TO RECEIVE
> Dec 24 11:00:13 corosync [TOTEM ] FAILED TO RECEIVE
> Dec 24 11:00:13 corosync [TOTEM ] FAILED TO RECEIVE
> Dec 24 11:00:13 corosync [TOTEM ] FAILED TO RECEIVE
> Dec 24 11:00:14 opsview-core02-tn stonithd: [5151]: ERROR: ais_dispatch: 
> Receiving message body failed: (2) Library error: No such file or 
> directory (2)

At this point the corosync process is no more. Best to send the
backtrace to the openais list.

Thanks,

Dejan