[Pacemaker] Problem with state: UNCLEAN (OFFLINE)

Florian Crouzat gentoo at floriancrouzat.net
Fri Jun 8 11:51:29 UTC 2012

Le 08/06/2012 13:01, Juan M. Sierra a écrit :
> Problem with state: UNCLEAN (OFFLINE)
> Hello,
> I'm trying to get up a directord service with pacemaker.
> But, I found a problem with the unclean (offline) state. The initial
> state of my cluster was this:
>     /Online: [ node2 node1 ]
>     node1-STONITH (stonith:external/ipmi): Started node2
>     node2-STONITH (stonith:external/ipmi): Started node1
>     Clone Set: Connected
>     Started: [ node2 node1 ]
>     Clone Set: ldirector-activo-activo
>     Started: [ node2 node1 ]
>     ftp-vip (ocf::heartbeat:IPaddr): Started node1
>     web-vip (ocf::heartbeat:IPaddr): Started node2
>     Migration summary:
>     * Node node1: pingd=2000
>     * Node node2: pingd=2000
>     node2-STONITH: migration-threshold=1000000 fail-count=1000000
>     /
> and then, I removed the electric connection of node1, the state was the
> next:
>     /Node node1 (8b2aede9-61bb-4a5a-aef6-25fbdefdddfd): UNCLEAN (offline)
>     Online: [ node2 ]
>     node1-STONITH (stonith:external/ipmi): Started node2 FAILED
>     Clone Set: Connected
>     Started: [ node2 ]
>     Stopped: [ ping:1 ]
>     Clone Set: ldirector-activo-activo
>     Started: [ node2 ]
>     Stopped: [ ldirectord:1 ]
>     web-vip (ocf::heartbeat:IPaddr): Started node2
>     Migration summary:
>     * Node node2: pingd=2000
>     node2-STONITH: migration-threshold=1000000 fail-count=1000000
>     node1-STONITH: migration-threshold=1000000 fail-count=1000000
>     Failed actions:
>     node2-STONITH_start_0 (node=node2, call=22, rc=2, status=complete):
>     invalid parameter
>     node1-STONITH_monitor_60000 (node=node2, call=11, rc=14,
>     status=complete): status: unknown
>     node1-STONITH_start_0 (node=node2, call=34, rc=1, status=complete):
>     unknown error
>     /
> I was hoping that node2 take the management of ftp-vip resource, but it
> wasn't in that way. node1 kept in a unclean state and node2 didn't take
> the management of its resources. When I put back the electric connection
> of node1 and it was recovered then, node2 took the management of ftp-vip
> resource.
> I've seen some similar conversations here. Please, could you show me
> some idea about this subject or some thread where this is discussed?
> Thanks a lot!
> Regards,

It has been discussed for resource failover but I guess it's the same: 

The motto here (discovered it a couple days ago) is "better have a 
hanged cluster than a corrupted one, especially with shared 
So, node1 failed but node2 hasn't been able to confirm its death because 
stonith failed apparently, then, the design choice is for the cluster to 
hang while waiting for a way to know the real state of node1 (at reboot 
in this case).

Florian Crouzat

More information about the Pacemaker mailing list