[Pacemaker] Problems with Pacemaker + Corosync after reboot
Shravan Mishra
shravan.mishra at gmail.com
Fri Dec 24 22:47:53 UTC 2010
HI,
Your configuration is straightforward, nothing out of the ordinary.
Make sure that when your other box comes up from offline, syslog-ng is
started before corosync. Because it appears that when you kill all the
process and restart by that time syslog-ng has started and everything comes
up properly.
Your resource will migrate back because there is no reason for it to to
stick there i.e. resource-stickiness.
You might want to look into how to get resource stickiness which may mean
enhancing your config a little more than what you have now. Configuration
manual explains it very nicely.
There is a tool called ptest you can use it to get the scores which
determines the stickiness for e.g. you can experiment with different
resource-stickiness values and then do
ptest -sL to look at the score.
You will have to go a bit deeper than your vanilla config to understand and
also read the manual.
Thanks
-Shravan
O n Thu, Dec 23, 2010 at 6:12 PM, Daniel Bareiro <daniel-listas at gmx.net>
wrote:
> On Wednesday, 22 December 2010 08:29:02 -0500,
> Shravan Mishra wrote:
>
>> Hi,
>
> Hi, Shravan.
>
>> What's happening is that corosync is forking but the exec is not
>> happening.
>
> And do you think that what is shown in the logs is consistent with what
> is shown using ps?
>
>> I used to see this problem in my case when syslog-ng process was not
>> running.
>>
>> Try checking that and starting it and then start corosync.
>
> Now I see that if I do a shutdown of the node that has the resource
> (failover-ip), then this does not migrate to another node. By doing the
> test I made sure Pacemaker + Corosync are functioning correctly on both
> nodes before doing a shutdown of Atlantis.
>
> Before making a shutdown of Atlantis:
>
> -----------------------------------------------------------------------
> daedalus:~# crm_mon --one-shot
> ============
> Last updated: Thu Dec 23 19:24:09 2010
> Stack: openais
> Current DC: atlantis - partition with quorum
> Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
> 2 Nodes configured, 2 expected votes
> 1 Resources configured.
> ============
>
> Online: [ atlantis daedalus ]
>
> failover-ip (ocf::heartbeat:IPaddr): Started atlantis
> -----------------------------------------------------------------------
>
> After doing a shutdown of Atlantis:
>
> -----------------------------------------------------------------------
> daedalus:~# crm_mon --one-shot
> ============
> Last updated: Thu Dec 23 19:25:44 2010
> Stack: openais
> Current DC: daedalus - partition WITHOUT quorum
> Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
> 2 Nodes configured, 2 expected votes
> 1 Resources configured.
> ============
>
> Online: [ daedalus ]
> OFFLINE: [ atlantis ]
> -----------------------------------------------------------------------
>
> Here I'm using a configuration like the one presented in the wiki [1].
>
> I am also noting that after the Atlantis launch, corosync makes the fork
> without exec (as we assume from what I showed in the previous mail) and
> only now is when the resource migrates to Daedalus:
>
> -----------------------------------------------------------------------
> daedalus:~# crm_mon --one-shot
> ============
> Last updated: Thu Dec 23 19:49:11 2010
> Stack: openais
> Current DC: daedalus - partition with quorum
> Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
> 2 Nodes configured, 2 expected votes
> 1 Resources configured.
> ============
>
> Online: [ daedalus ]
> OFFLINE: [ atlantis ]
>
> failover-ip (ocf::heartbeat:IPaddr): Started daedalus
> -----------------------------------------------------------------------
>
>
> -----------------------------------------------------------------------
> atlantis:~# crm_mon --one-shot
>
> Connection to cluster failed: connection failed
> -----------------------------------------------------------------------
>
> I tried doing a "corosync stop", but the processes are not closed:
>
> atlantis:~# ps auxf
> [...]
> root 1564 0.0 1.2 168144 3240 ? S 19:38 0:00
/usr/sbin/corosync
> root 1565 0.0 1.2 168144 3240 ? S 19:38 0:00
/usr/sbin/corosync
> root 1566 0.0 1.2 168144 3240 ? S 19:38 0:00
/usr/sbin/corosync
> root 1567 0.0 1.2 168144 3240 ? S 19:38 0:00
/usr/sbin/corosync
> root 1568 0.0 1.2 168144 3240 ? S 19:38 0:00
/usr/sbin/corosync
> root 1569 0.0 1.2 168144 3240 ? S 19:38 0:00
/usr/sbin/corosync
>
>
> The only way I found to correctly start corosync is doing a "pkill -9
> corosync" and "corosync start":
>
>
> atlantis:~# ps auxf
> [...]
> root 2120 0.2 1.9 134288 5060 ? Ssl 19:59 0:00
/usr/sbin/corosync
> root 2128 0.0 4.5 76028 11600 ? SLs 19:59 0:00 \_
/usr/lib/heartbeat/stonithd
> 105 2129 0.1 2.0 79104 5120 ? S 19:59 0:00 \_
/usr/lib/heartbeat/cib
> root 2130 0.0 0.8 71580 2108 ? S 19:59 0:00 \_
/usr/lib/heartbeat/lrmd
> 105 2131 0.0 1.3 79968 3340 ? S 19:59 0:00 \_
/usr/lib/heartbeat/attrd
> 105 2132 0.0 1.1 80332 2892 ? S 19:59 0:00 \_
/usr/lib/heartbeat/pengine
> 105 2133 0.0 1.4 86216 3764 ? S 19:59 0:00 \_
/usr/lib/heartbeat/crmd
>
>
> After this, the resource automatically migrates back to Atlantis:
>
> -----------------------------------------------------------------------
> daedalus:~# crm_mon --one-shot
> ============
> Last updated: Thu Dec 23 20:03:18 2010
> Stack: openais
> Current DC: daedalus - partition with quorum
> Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
> 2 Nodes configured, 2 expected votes
> 1 Resources configured.
> ============
>
> Online: [ atlantis daedalus ]
>
> failover-ip (ocf::heartbeat:IPaddr): Started atlantis
> -----------------------------------------------------------------------
>
>
> Any idea how to fix this problem with Corosync?
>
> Why to do a shutdown of Atlantis the resource does not migrate to
> Daedalus?
>
>
>
> Thanks for your reply.
>
> Regards,
> Daniel
>
> [1] http://www.clusterlabs.org/wiki/Debian_Lenny_HowTo
> --
> Daniel Bareiro - GNU/Linux registered user #188.598
> Proudly running Debian GNU/Linux with uptime:
> 17:52:45 up 71 days, 18:19, 10 users, load average: 0.00, 0.01, 0.03
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
>
> iEYEARECAAYFAk0T11kACgkQZpa/GxTmHTejywCfdVBAfru12t1LL8kvDiSCYGpJ
> c9YAnjlbFMF9NzFWKCsA1vkzdCfOCmJr
> =7Gh3
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20101224/ab1226e1/attachment-0001.htm>
More information about the Pacemaker
mailing list