[Pacemaker] Corosync service taking 100% cpu and is unable to stop gracefully
Parshvi
parshvi.17 at gmail.com
Thu Apr 19 11:11:32 UTC 2012
Major issues:
1) Corosync reaching over 100% cpu usage.
2) Corosync unable to stop gracefully.
3) Virtual IP of a resources being assigned as the primary IP on a interface,
after a cable disconnect/reconnect on that interface. The static IP on the
interface shown as global secondary IP.
Use case:
1) Two nodes in a cluster.
2) Two communication paths exists between the two nodes, with “rrp_mode” set to
active in corosync.conf
a. One path is a back-to-back connection between the nodes.
b. Second is via the LAN network switch.
3) The network cable was unplugged on one of the nodes for a while (on both the
interfaces). It was reconnected after a short while.
Observations:
1) Corosync service was taking 100% cpu on the node whose link was down:
a. In the above scenario Corosync service could not be stopped gracefully. A
SIGKILL had to be issued to stop the service.
b. On this node, of the two interfaces configured in corosync.conf, one was
being used for the Virtual IP’s preferred eth.
i. It was observed that when the link was up after a disconnection, the
primary global IP on that interface was the Virtual IP configured for a
resource.
ii. The static IP assigned to the interface was listed as “scope global
secondary” in the output of `ip addr show`.
iii. Also the Virtual IP of the resources configured in pacemaker were
active on both the nodes.
iv. `service network restart` also did not work.
c. Coroysnc service was stopped (Killed since it could not be stopped), the
network service was re-started and then corosync was re-started. All good after
this.
More information about the Pacemaker
mailing list