[Pacemaker] pacemaker and spanning tree in the network between the nodes
Dejan Muhamedagic
dejanmm at fastmail.fm
Mon Dec 21 06:44:17 EST 2009
Hi,
On Fri, Dec 18, 2009 at 03:44:11PM +0100, Sebastian Reitenbach wrote:
> Hi,
>
> I have a 4 node cluster, managing some XEN resouces. The XEN resources have
> location constrains defined, based on pingd. On each node, a pingd clone is
> running. XEN resources are only started, when the pingd is able to ping the
> ping node. The xen nodes also have a preferred and fallback location defined.
> The pingd resources have a timeout of 60 seconds defined.
> The cluster nodes run on SLES11, x86_64, with those rpms installed:
> heartbeat-3.0.0-33.2
> pacemaker-1.0.5-4.1
> libpacemaker3-1.0.5-4.1
> pacemaker-mgmt-client-1.99.2-7.1
> pacemaker-mgmt-1.99.2-7.1
> openais-0.80.3-26.1
> libopenais2-0.80.3-26.1
>
> I want to switch to a redundant network layout, using spanning tree between
> the switches. In case of a spanning tree recalculation because of a path
> failure or whatever other reason, I don't want to have nodes declared as dead
> because they cannot send heartbeat at that time to each other.
>
> Therefore I tried to prepare pacemaker on the cluster nodes.
> I put the whole cluster in maintenance mode via the hb_gui.
>
> Then I reconfigured /etc/ha.d/ha.cf and defined deadtime 70 and initdead 100.
> Then I restarted heartbeat on each cluster node. I waited until all cluster
> members were marked green/online in the GUI again. Then I turned off the
> maintenance mode.
> All XEN resources were shut down immediately.
Oops.
> Then
A sentence missing?
> In the hb_gui, the pingd resources looked a bit "strange". After leaving the
> maintenance mode, only one pingd resource showed the description
> ocf.:pacemaker:pingd, in hb_gui under Management. They were green, and showed
> it running on ['<server>'].
>
> Then I tried to restart the XEN resources manually, but the cluster only tried
> to start them on one host, not on the preferred or fallback location.
>
> Then I shutted down heartbeat on all 4 cluster nodes again, and put back the
> old ha.cf file, with deadtime 15 and initdead 40. And restarted heartbeat.
> After the cluster was running, the pingd resources were also started up.
> And then after the 60 seconds, the ping attribute was set, and the XEN
> resources were started up on all hosts.
>
> I wonder about some things:
> 1. why three of the pingd resources had no description shown after leaving the
> maintenance mode.
>
> 2. why all XEN resources were shut down after leaving the maintenance mode.
> Here I have a theory: In maintenance mode, the pingd attribute did not got
> updated, and because heartbeat was restarted on each node, the attribute was
> not set. Therefore when leaving the maintenance mode, pacemaker decided to
> shut down the XEN resources, because the pingd attribute was not set.
Sounds like a plausible explanation.
> 3. Why the pingd attribute was not set immediately after pingd started up, and
> was able to ping the ping node. After the pingd was started, then it waited 60
> seconds (the timeout value) to set the attribute so that then the XEN
> resources were able to start, due to their location constraint.
>
> 4. Maybe the answers to the other questions will answer this alaready:
> Why the cluster behaved that strange at all with the large timeout values set
> in ha.cf.
>
> I could also send a cluster-report in case it may help to figure out what was
> wrong here, I just did not wanted to send a large attachement to the list in
> the first place.
Probably the best to open a bugzilla and attach there the report.
I guess that special care is necessary on setting resources to
the unmanaged mode in case there are constraints which depend on
pingd attributes.
Thanks,
Dejan
> regards,
> Sebastian
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
More information about the Pacemaker
mailing list