[Pacemaker] Problem with state: UNCLEAN (OFFLINE)

Fri Jun 8 13:14:26 UTC 2012

Hello,

Thank you a lot. It's an interesting thread for my problem. I'll 
investigate about it more.

Regards,

El 08/06/12 13:51, Florian Crouzat escribió:
> Le 08/06/2012 13:01, Juan M. Sierra a écrit :
>> Problem with state: UNCLEAN (OFFLINE)
>>
>> Hello,
>>
>> I'm trying to get up a directord service with pacemaker.
>>
>> But, I found a problem with the unclean (offline) state. The initial
>> state of my cluster was this:
>>
>>     /Online: [ node2 node1 ]
>>
>>     node1-STONITH (stonith:external/ipmi): Started node2
>>     node2-STONITH (stonith:external/ipmi): Started node1
>>     Clone Set: Connected
>>     Started: [ node2 node1 ]
>>     Clone Set: ldirector-activo-activo
>>     Started: [ node2 node1 ]
>>     ftp-vip (ocf::heartbeat:IPaddr): Started node1
>>     web-vip (ocf::heartbeat:IPaddr): Started node2
>>
>>     Migration summary:
>>     * Node node1: pingd=2000
>>     * Node node2: pingd=2000
>>     node2-STONITH: migration-threshold=1000000 fail-count=1000000
>>     /
>>
>> and then, I removed the electric connection of node1, the state was the
>> next:
>>
>>     /Node node1 (8b2aede9-61bb-4a5a-aef6-25fbdefdddfd): UNCLEAN 
>> (offline)
>>     Online: [ node2 ]
>>
>>     node1-STONITH (stonith:external/ipmi): Started node2 FAILED
>>     Clone Set: Connected
>>     Started: [ node2 ]
>>     Stopped: [ ping:1 ]
>>     Clone Set: ldirector-activo-activo
>>     Started: [ node2 ]
>>     Stopped: [ ldirectord:1 ]
>>     web-vip (ocf::heartbeat:IPaddr): Started node2
>>
>>     Migration summary:
>>     * Node node2: pingd=2000
>>     node2-STONITH: migration-threshold=1000000 fail-count=1000000
>>     node1-STONITH: migration-threshold=1000000 fail-count=1000000
>>
>>     Failed actions:
>>     node2-STONITH_start_0 (node=node2, call=22, rc=2, status=complete):
>>     invalid parameter
>>     node1-STONITH_monitor_60000 (node=node2, call=11, rc=14,
>>     status=complete): status: unknown
>>     node1-STONITH_start_0 (node=node2, call=34, rc=1, status=complete):
>>     unknown error
>>     /
>>
>> I was hoping that node2 take the management of ftp-vip resource, but it
>> wasn't in that way. node1 kept in a unclean state and node2 didn't take
>> the management of its resources. When I put back the electric connection
>> of node1 and it was recovered then, node2 took the management of ftp-vip
>> resource.
>>
>> I've seen some similar conversations here. Please, could you show me
>> some idea about this subject or some thread where this is discussed?
>>
>> Thanks a lot!
>>
>> Regards,
>>
>
> It has been discussed for resource failover but I guess it's the same: 
> http://oss.clusterlabs.org/pipermail/pacemaker/2012-May/014260.html
>
> The motto here (discovered it a couple days ago) is "better have a 
> hanged cluster than a corrupted one, especially with shared 
> filesystem/resources.".
> So, node1 failed but node2 hasn't been able to confirm its death 
> because stonith failed apparently, then, the design choice is for the 
> cluster to hang while waiting for a way to know the real state of 
> node1 (at reboot in this case).
>
>

-- 
Juan Manuel Sierra Prieto
Administración de Sistemas
Centro Informatico Cientifico de Andalucia (CICA)
Avda. Reina Mercedes s/n - 41012 - Sevilla (Spain)
Tfno.: +34 955 056 600 / FAX: +34 955 056 650
Consejería de Economía, Innovación y Ciencia
Junta de Andalucía