[Pacemaker] Problem with state: UNCLEAN (OFFLINE)

Fri Jun 8 11:51:29 UTC 2012

Le 08/06/2012 13:01, Juan M. Sierra a écrit :
> Problem with state: UNCLEAN (OFFLINE)
>
> Hello,
>
> I'm trying to get up a directord service with pacemaker.
>
> But, I found a problem with the unclean (offline) state. The initial
> state of my cluster was this:
>
>     /Online: [ node2 node1 ]
>
>     node1-STONITH (stonith:external/ipmi): Started node2
>     node2-STONITH (stonith:external/ipmi): Started node1
>     Clone Set: Connected
>     Started: [ node2 node1 ]
>     Clone Set: ldirector-activo-activo
>     Started: [ node2 node1 ]
>     ftp-vip (ocf::heartbeat:IPaddr): Started node1
>     web-vip (ocf::heartbeat:IPaddr): Started node2
>
>     Migration summary:
>     * Node node1: pingd=2000
>     * Node node2: pingd=2000
>     node2-STONITH: migration-threshold=1000000 fail-count=1000000
>     /
>
> and then, I removed the electric connection of node1, the state was the
> next:
>
>     /Node node1 (8b2aede9-61bb-4a5a-aef6-25fbdefdddfd): UNCLEAN (offline)
>     Online: [ node2 ]
>
>     node1-STONITH (stonith:external/ipmi): Started node2 FAILED
>     Clone Set: Connected
>     Started: [ node2 ]
>     Stopped: [ ping:1 ]
>     Clone Set: ldirector-activo-activo
>     Started: [ node2 ]
>     Stopped: [ ldirectord:1 ]
>     web-vip (ocf::heartbeat:IPaddr): Started node2
>
>     Migration summary:
>     * Node node2: pingd=2000
>     node2-STONITH: migration-threshold=1000000 fail-count=1000000
>     node1-STONITH: migration-threshold=1000000 fail-count=1000000
>
>     Failed actions:
>     node2-STONITH_start_0 (node=node2, call=22, rc=2, status=complete):
>     invalid parameter
>     node1-STONITH_monitor_60000 (node=node2, call=11, rc=14,
>     status=complete): status: unknown
>     node1-STONITH_start_0 (node=node2, call=34, rc=1, status=complete):
>     unknown error
>     /
>
> I was hoping that node2 take the management of ftp-vip resource, but it
> wasn't in that way. node1 kept in a unclean state and node2 didn't take
> the management of its resources. When I put back the electric connection
> of node1 and it was recovered then, node2 took the management of ftp-vip
> resource.
>
> I've seen some similar conversations here. Please, could you show me
> some idea about this subject or some thread where this is discussed?
>
> Thanks a lot!
>
> Regards,
>

It has been discussed for resource failover but I guess it's the same: 
http://oss.clusterlabs.org/pipermail/pacemaker/2012-May/014260.html

The motto here (discovered it a couple days ago) is "better have a 
hanged cluster than a corrupted one, especially with shared 
filesystem/resources.".
So, node1 failed but node2 hasn't been able to confirm its death because 
stonith failed apparently, then, the design choice is for the cluster to 
hang while waiting for a way to know the real state of node1 (at reboot 
in this case).

-- 
Cheers,
Florian Crouzat