[Pacemaker] Problem with state: UNCLEAN (OFFLINE)

Fri Jun 8 07:45:56 EDT 2012

On Fri, Jun 8, 2012 at 1:01 PM, Juan M. Sierra <jmsierra at cica.es> wrote:
> Problem with state: UNCLEAN (OFFLINE)
>
> Hello,
>
> I'm trying to get up a directord service with pacemaker.
>
> But, I found a problem with the unclean (offline) state. The initial state
> of my cluster was this:
>
> Online: [ node2 node1 ]
>
> node1-STONITH    (stonith:external/ipmi):        Started node2
> node2-STONITH    (stonith:external/ipmi):        Started node1
>  Clone Set: Connected
>      Started: [ node2 node1 ]
>  Clone Set: ldirector-activo-activo
>      Started: [ node2 node1 ]
> ftp-vip (ocf::heartbeat:IPaddr):        Started node1
> web-vip (ocf::heartbeat:IPaddr):        Started node2
>
> Migration summary:
> * Node node1:  pingd=2000
> * Node node2:  pingd=2000
>    node2-STONITH: migration-threshold=1000000 fail-count=1000000
>
> and then, I removed the electric connection of node1, the state was the
> next:
>
> Node node1 (8b2aede9-61bb-4a5a-aef6-25fbdefdddfd): UNCLEAN (offline)
> Online: [ node2 ]
>
> node1-STONITH    (stonith:external/ipmi):        Started node2 FAILED
>  Clone Set: Connected
>      Started: [ node2 ]
>      Stopped: [ ping:1 ]
>  Clone Set: ldirector-activo-activo
>      Started: [ node2 ]
>      Stopped: [ ldirectord:1 ]
> web-vip (ocf::heartbeat:IPaddr):        Started node2
>
> Migration summary:
> * Node node2:  pingd=2000
>    node2-STONITH: migration-threshold=1000000 fail-count=1000000
>    node1-STONITH: migration-threshold=1000000 fail-count=1000000
>
> Failed actions:
>     node2-STONITH_start_0 (node=node2, call=22, rc=2, status=complete):
> invalid parameter
>     node1-STONITH_monitor_60000 (node=node2, call=11, rc=14,
> status=complete): status: unknown
>     node1-STONITH_start_0 (node=node2, call=34, rc=1, status=complete):
> unknown error
>
> I was hoping that node2 take the management of ftp-vip resource, but it
> wasn't in that way. node1 kept in a unclean state and node2 didn't take the
> management of its resources. When I put back the electric connection of
> node1 and it was recovered then, node2 took the management of ftp-vip
> resource.
>
> I've seen some similar conversations here. Please, could you show me some
> idea about this subject or some thread where this is discussed?

Well your healthy node failed to fence your offending node. So fix
your STONITH device configuration and as soon as that is able to
fence, your failover should work fine.

Of course, if your IPMI BMC fails immediately after you remove power
from the machine (i.e. it has no backup battery so it can at least
report the power status), then you might have to fix your issue by
switching to a different STONITH device altogether.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now