[Pacemaker] Problem with state: UNCLEAN (OFFLINE)
Florian Crouzat
gentoo at floriancrouzat.net
Fri Jun 8 11:51:29 UTC 2012
Le 08/06/2012 13:01, Juan M. Sierra a écrit :
> Problem with state: UNCLEAN (OFFLINE)
>
> Hello,
>
> I'm trying to get up a directord service with pacemaker.
>
> But, I found a problem with the unclean (offline) state. The initial
> state of my cluster was this:
>
> /Online: [ node2 node1 ]
>
> node1-STONITH (stonith:external/ipmi): Started node2
> node2-STONITH (stonith:external/ipmi): Started node1
> Clone Set: Connected
> Started: [ node2 node1 ]
> Clone Set: ldirector-activo-activo
> Started: [ node2 node1 ]
> ftp-vip (ocf::heartbeat:IPaddr): Started node1
> web-vip (ocf::heartbeat:IPaddr): Started node2
>
> Migration summary:
> * Node node1: pingd=2000
> * Node node2: pingd=2000
> node2-STONITH: migration-threshold=1000000 fail-count=1000000
> /
>
> and then, I removed the electric connection of node1, the state was the
> next:
>
> /Node node1 (8b2aede9-61bb-4a5a-aef6-25fbdefdddfd): UNCLEAN (offline)
> Online: [ node2 ]
>
> node1-STONITH (stonith:external/ipmi): Started node2 FAILED
> Clone Set: Connected
> Started: [ node2 ]
> Stopped: [ ping:1 ]
> Clone Set: ldirector-activo-activo
> Started: [ node2 ]
> Stopped: [ ldirectord:1 ]
> web-vip (ocf::heartbeat:IPaddr): Started node2
>
> Migration summary:
> * Node node2: pingd=2000
> node2-STONITH: migration-threshold=1000000 fail-count=1000000
> node1-STONITH: migration-threshold=1000000 fail-count=1000000
>
> Failed actions:
> node2-STONITH_start_0 (node=node2, call=22, rc=2, status=complete):
> invalid parameter
> node1-STONITH_monitor_60000 (node=node2, call=11, rc=14,
> status=complete): status: unknown
> node1-STONITH_start_0 (node=node2, call=34, rc=1, status=complete):
> unknown error
> /
>
> I was hoping that node2 take the management of ftp-vip resource, but it
> wasn't in that way. node1 kept in a unclean state and node2 didn't take
> the management of its resources. When I put back the electric connection
> of node1 and it was recovered then, node2 took the management of ftp-vip
> resource.
>
> I've seen some similar conversations here. Please, could you show me
> some idea about this subject or some thread where this is discussed?
>
> Thanks a lot!
>
> Regards,
>
It has been discussed for resource failover but I guess it's the same:
http://oss.clusterlabs.org/pipermail/pacemaker/2012-May/014260.html
The motto here (discovered it a couple days ago) is "better have a
hanged cluster than a corrupted one, especially with shared
filesystem/resources.".
So, node1 failed but node2 hasn't been able to confirm its death because
stonith failed apparently, then, the design choice is for the cluster to
hang while waiting for a way to know the real state of node1 (at reboot
in this case).
--
Cheers,
Florian Crouzat
More information about the Pacemaker
mailing list