[ClusterLabs] Pacemaker on-fail standby recovery does not start DRBD slave resource
Ken Gaillot
kgaillot at redhat.com
Wed Mar 30 16:46:21 UTC 2016
On 03/30/2016 11:20 AM, Sam Gardner wrote:
> I have configured some network resources to automatically standby their node if the system detects a failure on them. However, the DRBD slave that I have configured does not automatically restart after the node is "unstandby-ed" after the failure-timeout expires.
> Is there any way to make the "stopped" DRBDSlave resource automatically start again once the node is recovered?
>
> See the progression of events below:
>
> Running cluster:
> Wed Mar 30 16:04:20 UTC 2016
> Cluster name:
> Last updated: Wed Mar 30 16:04:20 2016
> Last change: Wed Mar 30 16:03:24 2016
> Stack: classic openais (with plugin)
> Current DC: ha-d1.tw.com - partition with quorum
> Version: 1.1.12-561c4cf
> 2 Nodes configured, 2 expected votes
> 7 Resources configured
>
>
> Online: [ ha-d1.tw.com ha-d2.tw.com ]
>
> Full list of resources:
>
> Resource Group: network
> inif (ocf::custom:ip.sh): Started ha-d1.tw.com
> outif (ocf::custom:ip.sh): Started ha-d1.tw.com
> dmz1 (ocf::custom:ip.sh): Started ha-d1.tw.com
> Master/Slave Set: DRBDMaster [DRBDSlave]
> Masters: [ ha-d1.tw.com ]
> Slaves: [ ha-d2.tw.com ]
> Resource Group: filesystem
> DRBDFS (ocf::heartbeat:Filesystem): Started ha-d1.tw.com
> Resource Group: application
> service_failover (ocf::custom:service_failover): Started ha-d1.tw.com
>
>
> version: 8.4.5 (api:1/proto:86-101)
> srcversion: 315FB2BBD4B521D13C20BF4
>
> 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
> ns:4 nr:0 dw:4 dr:757 al:1 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> [153766.565352] block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 21(1), total 21; compression: 100.0%
> [153766.568303] block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 21(1), total 21; compression: 100.0%
> [153766.568316] block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1
> [153766.568356] block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 exit code 255 (0xfffffffe)
> [153766.568363] block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
> [153766.568374] block drbd1: Began resync as SyncSource (will sync 4 KB [1 bits set]).
> [153766.568444] block drbd1: updated sync UUID B0DA745C79C56591:36E0631B6F022952:36DF631B6F022952:133127197CF097C6
> [153766.577695] block drbd1: Resync done (total 1 sec; paused 0 sec; 4 K/sec)
> [153766.577700] block drbd1: updated UUIDs B0DA745C79C56591:0000000000000000:36E0631B6F022952:36DF631B6F022952
> [153766.577705] block drbd1: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )¯
>
> Failure detected:
> Wed Mar 30 16:08:22 UTC 2016
> Cluster name:
> Last updated: Wed Mar 30 16:08:22 2016
> Last change: Wed Mar 30 16:03:24 2016
> Stack: classic openais (with plugin)
> Current DC: ha-d1.tw.com - partition with quorum
> Version: 1.1.12-561c4cf
> 2 Nodes configured, 2 expected votes
> 7 Resources configured
>
>
> Node ha-d1.tw.com: standby (on-fail)
> Online: [ ha-d2.tw.com ]
>
> Full list of resources:
>
> Resource Group: network
> inif (ocf::custom:ip.sh): Started ha-d1.tw.com
> outif (ocf::custom:ip.sh): Started ha-d1.tw.com
> dmz1 (ocf::custom:ip.sh): FAILED ha-d1.tw.com
> Master/Slave Set: DRBDMaster [DRBDSlave]
> Masters: [ ha-d1.tw.com ]
> Slaves: [ ha-d2.tw.com ]
> Resource Group: filesystem
> DRBDFS (ocf::heartbeat:Filesystem): Started ha-d1.tw.com
> Resource Group: application
> service_failover (ocf::custom:service_failover): Started ha-d1.tw.com
>
> Failed actions:
> dmz1_monitor_7000 on ha-d1.tw.com 'not running' (7): call=156, status=complete, last-rc-change='Wed Mar 30 16:08:19 2016', queued=0ms, exec=0ms
>
>
>
> version: 8.4.5 (api:1/proto:86-101)
> srcversion: 315FB2BBD4B521D13C20BF4
>
> 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
> ns:4 nr:0 dw:4 dr:765 al:1 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> [153766.568356] block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 exit code 255 (0xfffffffe)
> [153766.568363] block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
> [153766.568374] block drbd1: Began resync as SyncSource (will sync 4 KB [1 bits set]).
> [153766.568444] block drbd1: updated sync UUID B0DA745C79C56591:36E0631B6F022952:36DF631B6F022952:133127197CF097C6
> [153766.577695] block drbd1: Resync done (total 1 sec; paused 0 sec; 4 K/sec)
> [153766.577700] block drbd1: updated UUIDs B0DA745C79C56591:0000000000000000:36E0631B6F022952:36DF631B6F022952
> [153766.577705] block drbd1: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
> [154057.455270] e1000: eth2 NIC Link is Down
> [154057.455451] e1000 0000:02:02.0 eth2: Reset adapter
>
> Failover complete:
> Wed Mar 30 16:09:02 UTC 2016
> Cluster name:
> Last updated: Wed Mar 30 16:09:02 2016
> Last change: Wed Mar 30 16:03:24 2016
> Stack: classic openais (with plugin)
> Current DC: ha-d1.tw.com - partition with quorum
> Version: 1.1.12-561c4cf
> 2 Nodes configured, 2 expected votes
> 7 Resources configured
>
>
> Node ha-d1.tw.com: standby (on-fail)
> Online: [ ha-d2.tw.com ]
>
> Full list of resources:
>
> Resource Group: network
> inif (ocf::custom:ip.sh): Started ha-d2.tw.com
> outif (ocf::custom:ip.sh): Started ha-d2.tw.com
> dmz1 (ocf::custom:ip.sh): Started ha-d2.tw.com
> Master/Slave Set: DRBDMaster [DRBDSlave]
> Masters: [ ha-d2.tw.com ]
> Stopped: [ ha-d1.tw.com ]
> Resource Group: filesystem
> DRBDFS (ocf::heartbeat:Filesystem): Started ha-d2.tw.com
> Resource Group: application
> service_failover (ocf::custom:service_failover): Started ha-d2.tw.com
>
> Failed actions:
> dmz1_monitor_7000 on ha-d1.tw.com 'not running' (7): call=156, status=complete, last-rc-change='Wed Mar 30 16:08:19 2016', queued=0ms, exec=0ms
>
>
>
> version: 8.4.5 (api:1/proto:86-101)
> srcversion: 315FB2BBD4B521D13C20BF4
> [154094.894524] drbd wwwdata: conn( Disconnecting -> StandAlone )
> [154094.894525] drbd wwwdata: receiver terminated
> [154094.894527] drbd wwwdata: Terminating drbd_r_wwwdata
> [154094.894559] block drbd1: disk( UpToDate -> Failed )
> [154094.894569] block drbd1: bitmap WRITE of 0 pages took 0 jiffies
> [154094.894571] block drbd1: 4 KB (1 bits) marked out-of-sync by on disk bit-map.
> [154094.894574] block drbd1: disk( Failed -> Diskless )
> [154094.894647] block drbd1: drbd_bm_resize called with capacity == 0
> [154094.894652] drbd wwwdata: Terminating drbd_w_wwwdata
>
> Standby node recovered, with DRBDSlave stopped (I want DRBDSlave started here):
> Wed Mar 30 16:13:01 UTC 2016
> Cluster name:
> Last updated: Wed Mar 30 16:13:01 2016
> Last change: Wed Mar 30 16:03:24 2016
> Stack: classic openais (with plugin)
> Current DC: ha-d1.tw.com - partition with quorum
> Version: 1.1.12-561c4cf
> 2 Nodes configured, 2 expected votes
> 7 Resources configured
>
>
> Online: [ ha-d1.tw.com ha-d2.tw.com ]
>
> Full list of resources:
>
> Resource Group: network
> inif (ocf::custom:ip.sh): Started ha-d2.tw.com
> outif (ocf::custom:ip.sh): Started ha-d2.tw.com
> dmz1 (ocf::custom:ip.sh): Started ha-d2.tw.com
> Master/Slave Set: DRBDMaster [DRBDSlave]
> Masters: [ ha-d2.tw.com ]
> Stopped: [ ha-d1.tw.com ]
> Resource Group: filesystem
> DRBDFS (ocf::heartbeat:Filesystem): Started ha-d2.tw.com
> Resource Group: application
> service_failover (ocf::custom:service_failover): Started ha-d2.tw.com
>
>
> version: 8.4.5 (api:1/proto:86-101)
> srcversion: 315FB2BBD4B521D13C20BF4
> [154094.894574] block drbd1: disk( Failed -> Diskless )
> [154094.894647] block drbd1: drbd_bm_resize called with capacity == 0
> [154094.894652] drbd wwwdata: Terminating drbd_w_wwwdata
>
> --
> Sam Gardner
> Trustwave | SMART SECURITY ON DEMAND
This might be a bug. A crm_report covering a few minutes around when the
failure expires might help.
Does the slave start after the next cluster-recheck-interval?
More information about the Users
mailing list