[Pacemaker] cluster deadlock/malfunctioning
Andrew Beekhof
beekhof at gmail.com
Tue Dec 2 14:22:07 UTC 2008
On Fri, Nov 14, 2008 at 16:25, Raoul Bhatia [IPAX] <r.bhatia at ipax.at> wrote:
> dear list,
>
> i again encounter my cluster malfunctioning.
>
> i upgraded my configuration to use pam-ldap and libnss-ldap, which
> did not work out of the box.
>
> pacemaker tried to recover some errors and then stonithed both
> hosts.
>
> i now have only got:
>> Node: wc01 (31de4ab3-2d05-476e-8f9a-627ad6cd94ca): online
>> Node: wc02 (f36760d8-d84a-46b2-b452-4c8cac8b3396): online
>>
>> Clone Set: clone_nfs-common
>> Resource Group: group_nfs-common:0
>> nfs-common:0 (lsb:nfs-common): Started wc02
>> Resource Group: group_nfs-common:1
>> nfs-common:1 (lsb:nfs-common): Started wc01
>> Clone Set: DoFencing
>> stonith_rackpdu:0 (stonith:external/rackpdu): Started wc02
>> stonith_rackpdu:1 (stonith:external/rackpdu): Started wc01
>
> and pacemaker seems happy :)
>
> (please note, that i normally have several groups, clones and
> master/slave resources active.
>
> i took a look at pe-warn-12143.bz2 but do not know how to interpret the
> three different threads i see in the corresponding .dot file.
>
> can any1 explain how i may debug such a deadlock?
What exactly were you trying to determine here?
Running through ptest, I see two major areas of concern:
Clones clone_nfs-common contains non-OCF resource nfs-common:0 and so
can only be used as an anonymous clone. Set the globally-unique meta
attribute to false
Clones clone_mysql-proxy contains non-OCF resource mysql-proxy:0 and
so can only be used as an anonymous clone. Set the globally-unique
meta attribute to false
and
Hard error - drbd_www:0_monitor_0 failed with rc=4: Preventing
drbd_www:0 from re-starting on wc01
Hard error - drbd_www:1_monitor_0 failed with rc=4: Preventing
drbd_www:1 from re-starting on wc01
Hard error - drbd_mysql:1_monitor_0 failed with rc=4: Preventing
drbd_mysql:1 from re-starting on wc01
Hard error - drbd_mysql:0_monitor_0 failed with rc=4: Preventing
drbd_mysql:0 from re-starting on wc01
If drbd is failing, then I can imagine that would prevent much of the
rest of the cluster from being started.
Also, you might want to look into:
Operation nfs-kernel-server_monitor_0 found resource nfs-kernel-server
active on wc01
Operation nfs-common:0_monitor_0 found resource nfs-common:0 active on wc01
Having said all that, I just looked at the config and all of the above
is more than likely caused by the issue we spoke about the other day -
loading 0.6 config fragments into a 1.0 cluster (where all the meta
attributes now have dashes instead of underscores)
More information about the Pacemaker
mailing list