[Pacemaker] Problem with state: UNCLEAN (OFFLINE)
Juan M. Sierra
jmsierra at cica.es
Fri Jun 8 13:14:26 UTC 2012
Hello,
Thank you a lot. It's an interesting thread for my problem. I'll
investigate about it more.
Regards,
El 08/06/12 13:51, Florian Crouzat escribió:
> Le 08/06/2012 13:01, Juan M. Sierra a écrit :
>> Problem with state: UNCLEAN (OFFLINE)
>>
>> Hello,
>>
>> I'm trying to get up a directord service with pacemaker.
>>
>> But, I found a problem with the unclean (offline) state. The initial
>> state of my cluster was this:
>>
>> /Online: [ node2 node1 ]
>>
>> node1-STONITH (stonith:external/ipmi): Started node2
>> node2-STONITH (stonith:external/ipmi): Started node1
>> Clone Set: Connected
>> Started: [ node2 node1 ]
>> Clone Set: ldirector-activo-activo
>> Started: [ node2 node1 ]
>> ftp-vip (ocf::heartbeat:IPaddr): Started node1
>> web-vip (ocf::heartbeat:IPaddr): Started node2
>>
>> Migration summary:
>> * Node node1: pingd=2000
>> * Node node2: pingd=2000
>> node2-STONITH: migration-threshold=1000000 fail-count=1000000
>> /
>>
>> and then, I removed the electric connection of node1, the state was the
>> next:
>>
>> /Node node1 (8b2aede9-61bb-4a5a-aef6-25fbdefdddfd): UNCLEAN
>> (offline)
>> Online: [ node2 ]
>>
>> node1-STONITH (stonith:external/ipmi): Started node2 FAILED
>> Clone Set: Connected
>> Started: [ node2 ]
>> Stopped: [ ping:1 ]
>> Clone Set: ldirector-activo-activo
>> Started: [ node2 ]
>> Stopped: [ ldirectord:1 ]
>> web-vip (ocf::heartbeat:IPaddr): Started node2
>>
>> Migration summary:
>> * Node node2: pingd=2000
>> node2-STONITH: migration-threshold=1000000 fail-count=1000000
>> node1-STONITH: migration-threshold=1000000 fail-count=1000000
>>
>> Failed actions:
>> node2-STONITH_start_0 (node=node2, call=22, rc=2, status=complete):
>> invalid parameter
>> node1-STONITH_monitor_60000 (node=node2, call=11, rc=14,
>> status=complete): status: unknown
>> node1-STONITH_start_0 (node=node2, call=34, rc=1, status=complete):
>> unknown error
>> /
>>
>> I was hoping that node2 take the management of ftp-vip resource, but it
>> wasn't in that way. node1 kept in a unclean state and node2 didn't take
>> the management of its resources. When I put back the electric connection
>> of node1 and it was recovered then, node2 took the management of ftp-vip
>> resource.
>>
>> I've seen some similar conversations here. Please, could you show me
>> some idea about this subject or some thread where this is discussed?
>>
>> Thanks a lot!
>>
>> Regards,
>>
>
> It has been discussed for resource failover but I guess it's the same:
> http://oss.clusterlabs.org/pipermail/pacemaker/2012-May/014260.html
>
> The motto here (discovered it a couple days ago) is "better have a
> hanged cluster than a corrupted one, especially with shared
> filesystem/resources.".
> So, node1 failed but node2 hasn't been able to confirm its death
> because stonith failed apparently, then, the design choice is for the
> cluster to hang while waiting for a way to know the real state of
> node1 (at reboot in this case).
>
>
--
Juan Manuel Sierra Prieto
Administración de Sistemas
Centro Informatico Cientifico de Andalucia (CICA)
Avda. Reina Mercedes s/n - 41012 - Sevilla (Spain)
Tfno.: +34 955 056 600 / FAX: +34 955 056 650
Consejería de Economía, Innovación y Ciencia
Junta de Andalucía
More information about the Pacemaker
mailing list