[Pacemaker] Problem with state: UNCLEAN (OFFLINE)

Fri Jun 8 09:56:17 EDT 2012

I've seen the cable coming from the redundant PSU's backplane short 
against the chassis. I've seen the RJ45 connector for IPMI/iLO go back. 
Of course, a switch port could go bad or a network cable could come out. 
There are many ways that IPMI/iLO could fail independent of the incoming 
power.

For this reason, if you can test (I think you are not in production yet) 
the latest version of pacemaker, setup a switched PDU as a backup fence 
device. This is what I always do in RHCS. This way, if the IPMI does 
fail for whatever reason, you can reach out to the PDU(s) and cut off 
the power to both sides of the PSU.

For example (using pseudo cluster.conf terms)

Node1
   fence method1
     IPMI_node1
   fence method2
     pdu1 - outlet 1
     pdu2 - outlet 1
Node2
   fence method1
     IPMI_node2
   fence method2
     pdu1 - outlet 2
     pdu2 - outlet 2

Note that both PDUs will have to return success for the method itself to 
be considered a success.

On 06/08/2012 09:11 AM, Juan M. Sierra wrote:
> Hello,
>
> First of all, thank you very much for your quickly reply.
>
> Your advice has made me thinking about the energy problem and its
> relation with stonith. In my case, I use two machines with ILO-similar
> system (like HP servers) and two power supplies.
>
> Really, it's a very strange event that the two power supplies will fail
> together. The another case would be the motherboard will get seriously
> damaged.
>
> In any case, I understand I'll need a third element (independent of both
> machines) to ensure that stonith works fine. Maybe something like an UPS
> or an advanced power supply line.
>
> I'll try to investigate about this a little more. Again, thank you a lot
> for your help.
>
> Cheers,
>
> El 08/06/12 13:45, Florian Haas escribió:
>> On Fri, Jun 8, 2012 at 1:01 PM, Juan M. Sierra<jmsierra at cica.es> wrote:
>>> Problem with state: UNCLEAN (OFFLINE)
>>>
>>> Hello,
>>>
>>> I'm trying to get up a directord service with pacemaker.
>>>
>>> But, I found a problem with the unclean (offline) state. The initial
>>> state
>>> of my cluster was this:
>>>
>>> Online: [ node2 node1 ]
>>>
>>> node1-STONITH (stonith:external/ipmi): Started node2
>>> node2-STONITH (stonith:external/ipmi): Started node1
>>> Clone Set: Connected
>>> Started: [ node2 node1 ]
>>> Clone Set: ldirector-activo-activo
>>> Started: [ node2 node1 ]
>>> ftp-vip (ocf::heartbeat:IPaddr): Started node1
>>> web-vip (ocf::heartbeat:IPaddr): Started node2
>>>
>>> Migration summary:
>>> * Node node1: pingd=2000
>>> * Node node2: pingd=2000
>>> node2-STONITH: migration-threshold=1000000 fail-count=1000000
>>>
>>> and then, I removed the electric connection of node1, the state was the
>>> next:
>>>
>>> Node node1 (8b2aede9-61bb-4a5a-aef6-25fbdefdddfd): UNCLEAN (offline)
>>> Online: [ node2 ]
>>>
>>> node1-STONITH (stonith:external/ipmi): Started node2 FAILED
>>> Clone Set: Connected
>>> Started: [ node2 ]
>>> Stopped: [ ping:1 ]
>>> Clone Set: ldirector-activo-activo
>>> Started: [ node2 ]
>>> Stopped: [ ldirectord:1 ]
>>> web-vip (ocf::heartbeat:IPaddr): Started node2
>>>
>>> Migration summary:
>>> * Node node2: pingd=2000
>>> node2-STONITH: migration-threshold=1000000 fail-count=1000000
>>> node1-STONITH: migration-threshold=1000000 fail-count=1000000
>>>
>>> Failed actions:
>>> node2-STONITH_start_0 (node=node2, call=22, rc=2, status=complete):
>>> invalid parameter
>>> node1-STONITH_monitor_60000 (node=node2, call=11, rc=14,
>>> status=complete): status: unknown
>>> node1-STONITH_start_0 (node=node2, call=34, rc=1, status=complete):
>>> unknown error
>>>
>>> I was hoping that node2 take the management of ftp-vip resource, but it
>>> wasn't in that way. node1 kept in a unclean state and node2 didn't
>>> take the
>>> management of its resources. When I put back the electric connection of
>>> node1 and it was recovered then, node2 took the management of ftp-vip
>>> resource.
>>>
>>> I've seen some similar conversations here. Please, could you show me
>>> some
>>> idea about this subject or some thread where this is discussed?
>> Well your healthy node failed to fence your offending node. So fix
>> your STONITH device configuration and as soon as that is able to
>> fence, your failover should work fine.
>>
>> Of course, if your IPMI BMC fails immediately after you remove power
>> from the machine (i.e. it has no backup battery so it can at least
>> report the power status), then you might have to fix your issue by
>> switching to a different STONITH device altogether.
>>
>> Cheers,
>> Florian
>>
>

-- 
Digimer
Papers and Projects: https://alteeve.com