[Pacemaker] pacemaker and spanning tree in the network between the nodes
Sebastian Reitenbach
sebastia at l00-bugdead-prods.de
Fri Dec 18 14:44:11 UTC 2009
Hi,
I have a 4 node cluster, managing some XEN resouces. The XEN resources have
location constrains defined, based on pingd. On each node, a pingd clone is
running. XEN resources are only started, when the pingd is able to ping the
ping node. The xen nodes also have a preferred and fallback location defined.
The pingd resources have a timeout of 60 seconds defined.
The cluster nodes run on SLES11, x86_64, with those rpms installed:
heartbeat-3.0.0-33.2
pacemaker-1.0.5-4.1
libpacemaker3-1.0.5-4.1
pacemaker-mgmt-client-1.99.2-7.1
pacemaker-mgmt-1.99.2-7.1
openais-0.80.3-26.1
libopenais2-0.80.3-26.1
I want to switch to a redundant network layout, using spanning tree between
the switches. In case of a spanning tree recalculation because of a path
failure or whatever other reason, I don't want to have nodes declared as dead
because they cannot send heartbeat at that time to each other.
Therefore I tried to prepare pacemaker on the cluster nodes.
I put the whole cluster in maintenance mode via the hb_gui.
Then I reconfigured /etc/ha.d/ha.cf and defined deadtime 70 and initdead 100.
Then I restarted heartbeat on each cluster node. I waited until all cluster
members were marked green/online in the GUI again. Then I turned off the
maintenance mode.
All XEN resources were shut down immediately. Then
In the hb_gui, the pingd resources looked a bit "strange". After leaving the
maintenance mode, only one pingd resource showed the description
ocf.:pacemaker:pingd, in hb_gui under Management. They were green, and showed
it running on ['<server>'].
Then I tried to restart the XEN resources manually, but the cluster only tried
to start them on one host, not on the preferred or fallback location.
Then I shutted down heartbeat on all 4 cluster nodes again, and put back the
old ha.cf file, with deadtime 15 and initdead 40. And restarted heartbeat.
After the cluster was running, the pingd resources were also started up.
And then after the 60 seconds, the ping attribute was set, and the XEN
resources were started up on all hosts.
I wonder about some things:
1. why three of the pingd resources had no description shown after leaving the
maintenance mode.
2. why all XEN resources were shut down after leaving the maintenance mode.
Here I have a theory: In maintenance mode, the pingd attribute did not got
updated, and because heartbeat was restarted on each node, the attribute was
not set. Therefore when leaving the maintenance mode, pacemaker decided to
shut down the XEN resources, because the pingd attribute was not set.
3. Why the pingd attribute was not set immediately after pingd started up, and
was able to ping the ping node. After the pingd was started, then it waited 60
seconds (the timeout value) to set the attribute so that then the XEN
resources were able to start, due to their location constraint.
4. Maybe the answers to the other questions will answer this alaready:
Why the cluster behaved that strange at all with the large timeout values set
in ha.cf.
I could also send a cluster-report in case it may help to figure out what was
wrong here, I just did not wanted to send a large attachement to the list in
the first place.
regards,
Sebastian
More information about the Pacemaker
mailing list