[Pacemaker] TOTEM: Process pause detected? Leading to STONITH...

Thu Aug 4 12:46:00 UTC 2011

 Hello,

 here's another problem we're having:

 Jul 31 03:51:02 node01 corosync[5870]:  [TOTEM ] Process pause detected 
 for 11149 ms, flushing membership messages.
 Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] CLM CONFIGURATION 
 CHANGE
 Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] New Configuration:
 Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ]   r(0) ip(192.168.1.1) 
 r(1) ip(x.y.z.3)
 Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] Members Left:
 Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ]   r(0) ip(192.168.1.2) 
 r(1) ip(x.y.z.1)
 Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] Members Joined:
 Jul 31 03:51:11 node01 corosync[5870]:  [pcmk  ] notice: 
 pcmk_peer_update: Transitional membership event on ring 9708: memb=1, 
 new=0, lost=1
 Jul 31 03:51:11 node01 corosync[5870]:  [pcmk  ] info: 
 pcmk_peer_update: memb: node01 16885952
 Jul 31 03:51:11 node01 corosync[5870]:  [pcmk  ] info: 
 pcmk_peer_update: lost: node02 33663168
 Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] CLM CONFIGURATION 
 CHANGE
 Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] New Configuration:
 Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ]   r(0) ip(192.168.1.1) 
 r(1) ip(x.y.z.3)
 Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] Members Left:
 Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] Members Joined:
 Jul 31 03:51:11 node01 crmd: [5912]: notice: ais_dispatch_message: 
 Membership 9708: quorum lost

 Node01 gets Stonith'd shortly after that. There is no indication 
 whatsoever that this would happen in the logs.
 For at least half an hour before that there's only the normal 
 status-message noise from monitor ops etc.

 Jul 31 03:51:01 node02 corosync[5810]:  [TOTEM ] A processor failed, 
 forming new configuration.
 Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] CLM CONFIGURATION 
 CHANGE
 Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] New Configuration:
 Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ]   r(0) ip(192.168.1.2) 
 r(1) ip(x.y.z.1)
 Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] Members Left:
 Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ]   r(0) ip(192.168.1.1) 
 r(1) ip(x.y.z.3)
 Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] Members Joined:
 Jul 31 03:51:11 node02 corosync[5810]:  [pcmk  ] notice: 
 pcmk_peer_update: Transitional membership event on ring 9708: memb=1, 
 new=0, lost=1
 Jul 31 03:51:11 node02 corosync[5810]:  [pcmk  ] info: 
 pcmk_peer_update: memb: node02 33663168
 Jul 31 03:51:11 node02 corosync[5810]:  [pcmk  ] info: 
 pcmk_peer_update: lost: node01 16885952
 Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] CLM CONFIGURATION 
 CHANGE
 Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] New Configuration:
 Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ]   r(0) ip(192.168.1.2) 
 r(1) ip(x.y.z.1)
 Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] Members Left:
 Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] Members Joined:

 What does "Process pause detected" mean?

 Quoting from my other recent post regarding the backup ring being 
 marked faulty sporadically:

 |We're running a two-node cluster with redundant rings.
 |Ring 0 is a 10 GB direct connection; ring 1 consists of two 1GB 
 interfaces that are bonded in
 |active-backup mode and routed through two independent switches for 
 each node. The ring 1 network
 |is our "normal" 1G LAN and should only be used in case the direct 10G 
 connection should fail.
 |
 |Corosync Cluster Engine, version '1.3.1'
 |Copyright (c) 2006-2009 Red Hat, Inc.
 |
 |It's the version that comes with SLES11-SP1-HA.

 Thanks in advance!

-- 
 Sebastian