[Pacemaker] Backup ring is marked faulty
Steven Dake
sdake at redhat.com
Wed Aug 3 00:45:46 UTC 2011
Which version of corosync?
On 08/02/2011 07:35 AM, Sebastian Kaps wrote:
> Hi,
>
> we're running a two-node cluster with redundant rings.
> Ring 0 is a 10 GB direct connection; ring 1 consists of two 1GB
> interfaces that are bonded in
> active-backup mode and routed through two independent switches for each
> node. The ring 1 network
> is our "normal" 1G LAN and should only be used in case the direct 10G
> connection should fail.
> I often (once a day on average, I'd guess) see that ring 1 (an only that
> one) is marked as
> FAULTY without any obvious reasons.
>
> Aug 2 08:56:15 node02 corosync[5752]: [TOTEM ] Retransmit List: c76
> c7a c7c c7e c80 c82 c84
> Aug 2 08:56:15 node02 corosync[5752]: [TOTEM ] Retransmit List: c82
> Aug 2 08:56:15 node02 corosync[5752]: [TOTEM ] Marking seqid 568416
> ringid 1 interface x.y.z.1 FAULTY - administrative intervention required.
>
> Whenever I see this, I check if the other node's address can be pinged
> (I never saw any
> connectivity problems there), then reenable the ring with
> "corosync-cfgtool -r" and
> everything looks ok for a while (i.e. hours or days).
>
> How could I find out why this happens?
> What do these "Retransmit List" or seqid (sequence id, I assume?) values
> tell me?
> Is it safe to reenable the second ring when the partner node can be
> pinged successfully?
>
> The totem section on our config looks like this:
>
> totem {
> rrp_mode: passive
> join: 60
> max_messages: 20
> vsftype: none
> consensus: 10000
> secauth: on
> token_retransmits_before_loss_const: 10
> threads: 16
> token: 10000
> version: 2
> interface {
> bindnetaddr: 192.168.1.0
> mcastaddr: 239.250.1.1
> mcastport: 5405
> ringnumber: 0
> }
> interface {
> bindnetaddr: x.y.z.0
> mcastaddr: 239.250.1.2
> mcastport: 5415
> ringnumber: 1
> }
> clear_node_high_bit: yes
> }
>
More information about the Pacemaker
mailing list