[Pacemaker] pacemaker and spanning tree in the network between the nodes

Mon Dec 21 11:44:17 UTC 2009

Hi,

On Fri, Dec 18, 2009 at 03:44:11PM +0100, Sebastian Reitenbach wrote:
> Hi,
> 
> I have a 4 node cluster, managing some XEN resouces. The XEN resources have 
> location constrains defined, based on pingd. On each node, a pingd clone is 
> running. XEN resources are only started, when the pingd is able to ping the 
> ping node. The xen nodes also have a preferred and fallback location defined.
> The pingd resources have a timeout of 60 seconds defined.
> The cluster nodes run on SLES11, x86_64, with those rpms installed:
> heartbeat-3.0.0-33.2
> pacemaker-1.0.5-4.1
> libpacemaker3-1.0.5-4.1
> pacemaker-mgmt-client-1.99.2-7.1
> pacemaker-mgmt-1.99.2-7.1
> openais-0.80.3-26.1
> libopenais2-0.80.3-26.1
> 
> I want to switch to a redundant network layout, using spanning tree between 
> the switches. In case of a spanning tree recalculation because of a path 
> failure or whatever other reason, I don't want to have nodes declared as dead 
> because they cannot send heartbeat at that time to each other.
> 
> Therefore I tried to prepare pacemaker on the cluster nodes. 
> I put the whole cluster in maintenance mode via the hb_gui.
> 
> Then I reconfigured /etc/ha.d/ha.cf and defined deadtime 70 and initdead 100.
> Then I restarted heartbeat on each cluster node. I waited until all cluster 
> members were marked green/online in the GUI again. Then I turned off the 
> maintenance mode.
> All XEN resources were shut down immediately.

Oops.

> Then 

A sentence missing?

> In the hb_gui, the pingd resources looked a bit "strange". After leaving the 
> maintenance mode, only one pingd resource showed the description 
> ocf.:pacemaker:pingd, in hb_gui under Management. They were green, and showed 
> it running on ['<server>'].
> 
> Then I tried to restart the XEN resources manually, but the cluster only tried 
> to start them on one host, not on the preferred or fallback location.
> 
> Then I shutted down heartbeat on all 4 cluster nodes again, and put back the 
> old ha.cf file, with deadtime 15 and initdead 40. And restarted heartbeat.
> After the cluster was running, the pingd resources were also started up. 
> And then after the 60 seconds, the ping attribute was set, and the XEN 
> resources were started up on all hosts.
> 
> I wonder about some things:
> 1. why three of the pingd resources had no description shown after leaving the 
> maintenance mode.
> 
> 2. why all XEN resources were shut down after leaving the maintenance mode.
> Here I have a theory: In maintenance mode, the pingd attribute did not got 
> updated, and because heartbeat was restarted on each node, the attribute was 
> not set. Therefore when leaving the maintenance mode, pacemaker decided to 
> shut down the XEN resources, because the pingd attribute was not set.

Sounds like a plausible explanation.

> 3. Why the pingd attribute was not set immediately after pingd started up, and 
> was able to ping the ping node. After the pingd was started, then it waited 60 
> seconds (the timeout value) to set the attribute so that then the XEN 
> resources were able to start, due to their location constraint.
> 
> 4. Maybe the answers to the other questions will answer this alaready: 
> Why the cluster behaved that strange at all with the large timeout values set 
> in ha.cf.
> 
> I could also send a cluster-report in case it may help to figure out what was 
> wrong here, I just did not wanted to send a large attachement to the list in 
> the first place.

Probably the best to open a bugzilla and attach there the report.
I guess that special care is necessary on setting resources to
the unmanaged mode in case there are constraints which depend on
pingd attributes.

Thanks,

Dejan

> regards,
> Sebastian
> 
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker