[Pacemaker] Failover when storage fails
Max Williams
Max.Williams at betfair.com
Wed May 11 12:55:06 UTC 2011
Hi,
I want to configure pacemaker to failover a group of resources and sg_persist (master/slave) when there is a problem with the storage but when I cause the iSCSI LUN to disappear simulating a failure, the cluster always gets stuck in this state:
Last updated: Wed May 11 10:52:43 2011
Stack: openais
Current DC: host001.domain - partition with quorum
Version: 1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe
2 Nodes configured, 2 expected votes
4 Resources configured.
============
Online: [ host002.domain host001.domain ]
fence_host002.domain (stonith:fence_ipmilan): Started host001.domain
fence_host001.domain (stonith:fence_ipmilan): Started host001.domain
Resource Group: MyApp_group
MyApp_lvm_graph (ocf::heartbeat:LVM): Started host002.domain FAILED
MyApp_lvm_landing (ocf::heartbeat:LVM): Started host002.domain FAILED
MyApp_fs_graph (ocf::heartbeat:Filesystem): Started host002.domain
MyApp_fs_landing (ocf::heartbeat:Filesystem): Started host002.domain
MyApp_IP (ocf::heartbeat:IPaddr): Stopped
MyApp_init_script (lsb:abworkload): Stopped
Master/Slave Set: ms_MyApp_scsi_reservation
Masters: [ host002.domain ]
Slaves: [ host001.domain ]
Failed actions:
MyApp_lvm_graph_monitor_10000 (node=host002.domain, call=129, rc=-2, status=Timed Out): unknown exec error
MyApp_lvm_landing_monitor_10000 (node=host002.domain, call=130, rc=-2, status=Timed Out): unknown exec error
This is printed over and over in the logs:
May 11 12:34:56 host002 lrmd: [2561]: info: perform_op:2884: operation stop[202] on ocf::Filesystem::MyApp_fs_graph for client 31850, its parameters: fstype=[ext4] crm_feature_set=[3.0.2] device=[/dev/VolGroupB00/abb_graph] CRM_meta_timeout=[20000] directory=[/naab1] for rsc is already running.
May 11 12:34:56 host002 lrmd: [2561]: info: perform_op:2894: postponing all ops on resource MyApp_fs_graph by 1000 ms
May 11 12:34:57 host002 lrmd: [2561]: info: perform_op:2884: operation stop[202] on ocf::Filesystem::MyApp_fs_graph for client 31850, its parameters: fstype=[ext4] crm_feature_set=[3.0.2] device=[/dev/VolGroupB00/abb_graph] CRM_meta_timeout=[20000] directory=[/naab1] for rsc is already running.
May 11 12:34:57 host002 lrmd: [2561]: info: perform_op:2894: postponing all ops on resource MyApp_fs_graph by 1000 ms
May 11 12:34:58 host002 lrmd: [2561]: info: perform_op:2884: operation stop[202] on ocf::Filesystem::MyApp_fs_graph for client 31850, its parameters: fstype=[ext4] crm_feature_set=[3.0.2] device=[/dev/VolGroupB00/abb_graph] CRM_meta_timeout=[20000] directory=[/naab1] for rsc is already running.
May 11 12:34:58 host002 lrmd: [2561]: info: perform_op:2894: postponing all ops on resource MyApp_fs_graph by 1000 ms
May 11 12:34:58 host002 lrmd: [2561]: WARN: MyApp_lvm_graph:monitor process (PID 1938) timed out (try 1). Killing with signal SIGTERM (15).
May 11 12:34:58 host002 lrmd: [2561]: WARN: MyApp_lvm_landing:monitor process (PID 1939) timed out (try 1). Killing with signal SIGTERM (15).
May 11 12:34:58 host002 lrmd: [2561]: WARN: operation monitor[190] on ocf::LVM::MyApp_lvm_graph for client 31850, its parameters: CRM_meta_depth=[0] depth=[0] exclusive=[yes] crm_feature_set=[3.0.2] volgrpname=[VolGroupB00] CRM_meta_on_fail=[standby] CRM_meta_name=[monitor] CRM_meta_interval=[10000] CRM_meta_timeout=[10000] : pid [1938] timed out
May 11 12:34:58 host002 lrmd: [2561]: WARN: operation monitor[191] on ocf::LVM::MyApp_lvm_landing for client 31850, its parameters: CRM_meta_depth=[0] depth=[0] exclusive=[yes] crm_feature_set=[3.0.2] volgrpname=[VolGroupB01] CRM_meta_on_fail=[standby] CRM_meta_name=[monitor] CRM_meta_interval=[10000] CRM_meta_timeout=[10000] : pid [1939] timed out
May 11 12:34:58 host002 crmd: [31850]: ERROR: process_lrm_event: LRM operation MyApp_lvm_graph_monitor_10000 (190) Timed Out (timeout=10000ms)
May 11 12:34:58 host002 crmd: [31850]: ERROR: process_lrm_event: LRM operation MyApp_lvm_landing_monitor_10000 (191) Timed Out (timeout=10000ms)
May 11 12:34:59 host002 lrmd: [2561]: info: perform_op:2884: operation stop[202] on ocf::Filesystem::MyApp_fs_graph for client 31850, its parameters: fstype=[ext4] crm_feature_set=[3.0.2] device=[/dev/VolGroupB00/abb_graph] CRM_meta_timeout=[20000] directory=[/naab1] for rsc is already running.
May 11 12:34:59 host002 lrmd: [2561]: info: perform_op:2894: postponing all ops on resource MyApp_fs_graph by 1000 ms
And I noticed there are about 100 vgdisplay processes running in D state.
How can I configure Pacemaker so the other host forces sg_persist to be a master and then just takes the whole resource group without fencing?
I've tried "on-fail=standby" or "migration-threshold=0" but it just always gets stuck in this state. If I reconnect the LUN everything resumes and it instantly fails over but this is less than ideal.
Thanks,
Max
________________________________________________________________________
In order to protect our email recipients, Betfair Group use SkyScan from
MessageLabs to scan all Incoming and Outgoing mail for viruses.
________________________________________________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20110511/fe7de7f0/attachment-0003.html>
More information about the Pacemaker
mailing list