[Pacemaker] Problems shutting down server/services

Wed Jun 16 18:09:13 UTC 2010

Hi All,

I'm having a weird issue that I was hoping someone could shed some light 
on...

I've got Pacemaker 1.0.8 and Corosync 1.2.2 on a two node cluster.  I'm 
managing some VMs via VirtualDomain, with DRBD under that.  It mostly 
works, but I'm having a freal hard time shutting down services, or even 
shutting down one of the nodes.  As I understand it, if I issue the 
shutdown command to one node, pacemaker should try and stop services on 
that node and move them to the other node (if configured), then power down.

Normally crm_mon shows this:

============
Last updated: Wed Jun 16 10:23:14 2010
Stack: openais
Current DC: hgvmcore2 - partition with quorum
Version: 1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7
2 Nodes configured, 2 expected votes
4 Resources configured.
============

Online: [ hgvmcore1 hgvmcore2 ]

  Master/Slave Set: ms-drbd-caweb
      Masters: [ hgvmcore1 ]
      Slaves: [ hgvmcore2 ]
caweb-vd        (ocf::heartbeat:VirtualDomain): Started hgvmcore1
  Master/Slave Set: ms-drbd-genome-nt
      Masters: [ hgvmcore2 ]
      Slaves: [ hgvmcore1 ]
genome-nt-vd    (ocf::heartbeat:VirtualDomain): Started hgvmcore2

Looks good.  When I try to reboot hgvmcore1, it does not power off, and 
I see this in the logs:

Jun 16 10:30:16 hgvmcore1 shutdown[20339]: shutting down for system reboot
Jun 16 10:30:16 hgvmcore1 init: Switching to runlevel: 6
Jun 16 10:30:17 hgvmcore1 crm_attribute: [20398]: info: Invoked: 
crm_attribute -N hgvmcore1 -n master-drbd-genome-nt:1 -l reboot -v 100
Jun 16 10:30:17 hgvmcore1 smartd[16139]: smartd received signal 15: 
Terminated
Jun 16 10:30:17 hgvmcore1 smartd[16139]: smartd is exiting (exit status 0)
Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Got SIGTERM, quitting.
Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast 
group on interface vnet1.IPv6 with address fe80::7817:6eff:fe1c:506f.
Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast 
group on interface vnet0.IPv6 with address fe80::d465:2ff:fefa:73ff.
Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast 
group on interface virbr0.IPv6 with address fe80::200:ff:fe00:0.
Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast 
group on interface virbr0.IPv4 with address 192.168.122.1.
Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast 
group on interface br3.IPv6 with address fe80::225:64ff:fefc:1e3b.
Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast 
group on interface br3.IPv4 with address 128.114.48.135.
Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast 
group on interface br2.IPv6 with address fe80::225:64ff:fefc:1e39.
Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast 
group on interface br2.IPv4 with address 10.1.1.89.
Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast 
group on interface eth3.IPv6 with address fe80::225:64ff:fefc:1e3b.
Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast 
group on interface eth2.IPv6 with address fe80::225:64ff:fefc:1e39.
Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast 
group on interface eth1.IPv6 with address fe80::210:18ff:fe67:db0a.
Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast 
group on interface eth1.IPv4 with address 172.16.0.1.
Jun 16 10:30:18 hgvmcore1 libvirtd: 10:30:18.192: warning : Shutting 
down on signal 15
Jun 16 10:30:18 hgvmcore1 corosync[15541]:   [SERV  ] Unloading all 
Corosync service engines.
Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: crm_signal_dispatch: 
Invoking handler for signal 15: Terminated
Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: crm_shutdown: Requesting 
shutdown
Jun 16 10:30:18 hgvmcore1 corosync[15541]:   [pcmk  ] notice: 
pcmk_shutdown: Shuting down Pacemaker
Jun 16 10:30:18 hgvmcore1 corosync[15541]:   [pcmk  ] notice: 
stop_child: Sent -15 to crmd: [15563]
Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: do_shutdown_req: Sending 
shutdown request to DC: hgvmcore2
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_ais_dispatch: 
Update relayed from hgvmcore2
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_trigger_update: 
Sending flush op to all hosts for: shutdown (1276709418)
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_perform_update: 
Sent update 48: shutdown=1276709418
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: cancel_op: operation 
monitor[21] on ocf::VirtualDomain::caweb-vd for client 15563, its 
parameters: CRM_meta_depth=[0] CRM_meta_interval=[10000] depth=[0] 
hypervisor=[qemu:///system] CRM_meta_timeout=[30000] 
crm_feature_set=[3.0.1] CRM_meta_name=[monitor] 
config=[/etc/libvirt/qemu/caweb.xml]  cancelled
Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: do_lrm_rsc_op: Performing 
key=39:448:0:13ca092e-78e7-4345-8728-7f8e8c2ddb29 op=caweb-vd_stop_0 )
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: rsc:caweb-vd:22: stop
Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: do_lrm_rsc_op: Performing 
key=88:448:0:13ca092e-78e7-4345-8728-7f8e8c2ddb29 op=drbd-caweb:0_notify_0 )
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: rsc:drbd-caweb:0:23: notify
Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: do_lrm_rsc_op: Performing 
key=96:448:0:13ca092e-78e7-4345-8728-7f8e8c2ddb29 
op=drbd-genome-nt:1_notify_0 )
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: rsc:drbd-genome-nt:1:24: 
notify
Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: process_lrm_event: LRM 
operation caweb-vd_monitor_10000 (call=21, status=1, cib-update=0, 
confirmed=true) Cancelled
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output: 
(caweb-vd:stop:stderr) error: unable to connect to 
'/var/run/libvirt/libvirt-sock': Connection refused error: failed to 
connect to the hypervisor
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output: 
(caweb-vd:stop:stderr) error:
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output: 
(caweb-vd:stop:stderr) unable to connect to 
'/var/run/libvirt/libvirt-sock': Connection refused error: failed to 
connect to the hypervisor
Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: process_lrm_event: LRM 
operation drbd-caweb:0_notify_0 (call=23, rc=0, cib-update=32, 
confirmed=true) ok
Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: process_lrm_event: LRM 
operation drbd-genome-nt:1_notify_0 (call=24, rc=0, cib-update=33, 
confirmed=true) ok
Jun 16 10:30:18 hgvmcore1 VirtualDomain[20485]: ERROR: Virtual domain 
caweb has no state during stop operation, bailing out.
Jun 16 10:30:18 hgvmcore1 VirtualDomain[20485]: INFO: Issuing forced 
shutdown (destroy) request for domain caweb.
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output: 
(caweb-vd:stop:stderr) error: unable to connect to 
'/var/run/libvirt/libvirt-sock': Connection refused error:
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output: 
(caweb-vd:stop:stderr) failed to connect to the hypervisor
Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: process_lrm_event: LRM 
operation caweb-vd_stop_0 (call=22, rc=1, cib-update=34, confirmed=true) 
unknown error
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: cancel_op: operation 
monitor[11] on ocf::drbd::drbd-genome-nt:1 for client 15563, its 
parameters: CRM_meta_interval=[15000] CRM_meta_notify_stop_resource=[ ] 
CRM_meta_notify_active_resource=[ ] CRM_meta_notify_slave_resource=[ ] 
CRM_meta_notify_start_resource=[drbd-genome-nt:1 ] 
CRM_meta_notify_active_uname=[ ] CRM_meta_notify_promote_resource=[ ] 
CRM_meta_notify_stop_uname=[ ] drbd_resource=[genome-nt] 
CRM_meta_notify_master_uname=[hgvmcore2 ] CRM_meta_master_max=[1] 
CRM_meta_notify_demote_uname=[ ] CRM_meta_notify_ cancelled
Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: do_lrm_rsc_op: Performing 
key=46:448:0:13ca092e-78e7-4345-8728-7f8e8c2ddb29 
op=drbd-genome-nt:1_stop_0 )
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: rsc:drbd-genome-nt:1:25: stop
Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: process_lrm_event: LRM 
operation drbd-genome-nt:1_monitor_15000 (call=11, status=1, 
cib-update=0, confirmed=true) Cancelled
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_ais_dispatch: 
Update relayed from hgvmcore2
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_trigger_update: 
Sending flush op to all hosts for: fail-count-caweb-vd (INFINITY)
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: peer( Primary -> Unknown 
) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown )
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: meta connection shut down 
by peer.
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: asender terminated
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: Terminating asender thread
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: Connection closed
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: conn( Disconnecting -> 
StandAlone )
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: receiver terminated
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: Terminating receiver thread
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: disk( UpToDate -> Diskless )
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: drbd_bm_resize called 
with capacity == 0
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: worker terminated
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: Terminating worker thread
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output: 
(drbd-genome-nt:1:stop:stdout)
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: State change failed: Disk 
state is lower than outdated
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1:   state = { cs:StandAlone 
ro:Secondary/Unknown ds:Diskless/DUnknown r--- }
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1:  wanted = { cs:StandAlone 
ro:Secondary/Unknown ds:Outdated/DUnknown r--- }
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_perform_update: 
Sent update 51: fail-count-caweb-vd=INFINITY
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_ais_dispatch: 
Update relayed from hgvmcore2
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_trigger_update: 
Sending flush op to all hosts for: last-failure-caweb-vd (1276709418)
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_perform_update: 
Sent update 54: last-failure-caweb-vd=1276709418
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output: 
(drbd-genome-nt:1:stop:stdout)
Jun 16 10:30:18 hgvmcore1 crm_attribute: [20593]: info: Invoked: 
crm_attribute -N hgvmcore1 -n master-drbd-genome-nt:1 -l reboot -v -INFINITY
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_trigger_update: 
Sending flush op to all hosts for: master-drbd-genome-nt:1 (-INFINITY)
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_perform_update: 
Sent update 56: master-drbd-genome-nt:1=-INFINITY
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output: 
(drbd-genome-nt:1:stop:stdout)
Jun 16 10:30:18 hgvmcore1 crm_attribute: [20597]: info: Invoked: 
crm_attribute -N hgvmcore1 -n master-drbd-genome-nt:1 -l reboot -D
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_trigger_update: 
Sending flush op to all hosts for: master-drbd-genome-nt:1 (<null>)
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_perform_update: 
Sent delete 58: node=hgvmcore1, attr=master-drbd-genome-nt:1, id=<n/a>, 
set=(null), section=status
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output: 
(drbd-genome-nt:1:stop:stdout)
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_perform_update: 
Sent delete 60: node=hgvmcore1, attr=master-drbd-genome-nt:1, id=<n/a>, 
set=(null), section=status
Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: process_lrm_event: LRM 
operation drbd-genome-nt:1_stop_0 (call=25, rc=0, cib-update=35, 
confirmed=true) ok
Jun 16 10:30:19 hgvmcore1 crmd: [15563]: info: do_lrm_rsc_op: Performing 
key=85:449:0:13ca092e-78e7-4345-8728-7f8e8c2ddb29 op=drbd-caweb:0_notify_0 )
Jun 16 10:30:19 hgvmcore1 lrmd: [15560]: info: rsc:drbd-caweb:0:26: notify
Jun 16 10:30:19 hgvmcore1 crmd: [15563]: info: process_lrm_event: LRM 
operation drbd-caweb:0_notify_0 (call=26, rc=0, cib-update=36, 
confirmed=true) ok
Jun 16 10:30:48 hgvmcore1 corosync[15541]:   [pcmk  ] notice: 
pcmk_shutdown: Still waiting for crmd (pid=15563, seq=6) to terminate...

and crm_mon shows this:

============
Last updated: Wed Jun 16 10:30:18 2010
Stack: openais
Current DC: hgvmcore2 - partition with quorum
Version: 1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7
2 Nodes configured, 2 expected votes
4 Resources configured.
============

Online: [ hgvmcore1 hgvmcore2 ]

  Master/Slave Set: ms-drbd-caweb
      Masters: [ hgvmcore1 ]
      Slaves: [ hgvmcore2 ]
caweb-vd        (ocf::heartbeat:VirtualDomain): Started hgvmcore1 
(unmanaged) FAILED
  Master/Slave Set: ms-drbd-genome-nt
      Masters: [ hgvmcore2 ]
      Stopped: [ drbd-genome-nt:1 ]
genome-nt-vd    (ocf::heartbeat:VirtualDomain): Started hgvmcore2

Failed actions:
     caweb-vd_stop_0 (node=hgvmcore1, call=22, rc=1, status=complete): 
unknown error

So it is having trouble shutting down...?  After like 15-20 minutes the 
computer actually does shut down/reboot, but the VMs shut down quickly, 
not sure about DRBD.  I don't know if it would help, but here is my crm 
config:

node hgvmcore1
node hgvmcore2
primitive caweb-vd ocf:heartbeat:VirtualDomain \
         params config="/etc/libvirt/qemu/caweb.xml" 
hypervisor="qemu:///system" \
         meta allow-migrate="false" is-managed="true" 
target-role="Started" \
         op start interval="0" timeout="120s" \
         op stop interval="0" timeout="120s" \
         op monitor interval="10" timeout="30" depth="0"
primitive drbd-caweb ocf:linbit:drbd \
         params drbd_resource="caweb" \
         op monitor interval="15s" \
         op start interval="0" timeout="240s" \
         op stop interval="0" timeout="100s"
primitive drbd-genome-nt ocf:linbit:drbd \
         params drbd_resource="genome-nt" \
         op monitor interval="15s" \
         op start interval="0" timeout="240s" \
         op stop interval="0" timeout="100s"
primitive genome-nt-vd ocf:heartbeat:VirtualDomain \
         params config="/etc/libvirt/qemu/genome-nt.xml" 
hypervisor="qemu:///system" \
         meta allow-migrate="false" is-managed="true" 
target-role="Started" \
         op start interval="0" timeout="120s" \
         op stop interval="0" timeout="120s" \
         op monitor interval="10" timeout="30" depth="0"
ms ms-drbd-caweb drbd-caweb \
         meta master-max="1" master-node-max="1" clone-max="2" 
clone-node-max="1" notify="true" target-role="Started" is-managed="true"
ms ms-drbd-genome-nt drbd-genome-nt \
         meta master-max="1" master-node-max="1" clone-max="2" 
clone-node-max="1" notify="true" target-role="Started"
location caweb-prefers-hgvmcore1 caweb-vd 50: hgvmcore1
location genome-nt-prefers-hgvmcore2 genome-nt-vd 50: hgvmcore2
colocation caweb-vd-on-drbd inf: caweb-vd ms-drbd-caweb:Master
colocation genome-nt-vd-on-drbd inf: genome-nt-vd ms-drbd-genome-nt:Master
order caweb-after-drbd inf: ms-drbd-caweb:promote caweb-vd:start
order genome-nt-after-drbd inf: ms-drbd-genome-nt:promote genome-nt-vd:start
property $id="cib-bootstrap-options" \
         dc-version="1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7" \
         cluster-infrastructure="openais" \
         expected-quorum-votes="2" \
         stonith-enabled="false" \
         no-quorum-policy="ignore" \
         last-lrm-refresh="1276708976"
rsc_defaults $id="rsc-options" \
         resource-stickiness="100"

And here is the corosync config:

# cat /etc/corosync/corosync.conf
# Please read the corosync.conf.5 manual page
compatibility: whitetank

aisexec {
        user:   root
        group:  root
}

totem {
        version: 2

        # How long before declaring a token lost (ms)
        token:          5000

        # How many token retransmits before forming a new configuration
        token_retransmits_before_loss_const: 20

        # How long to wait for join messages in the membership protocol (ms)
        join:           1000

        # How long to wait for consensus to be achieved before starting 
a new round of membership configuration (ms)
        consensus:      7500

        # Turn off the virtual synchrony filter
        vsftype:        none

        # Number of messages that may be sent by one processor on 
receipt of the token
        max_messages:   20

        # Disable encryption
        secauth:        off

        # How many threads to use for encryption/decryption
        threads:        0

        # Limit generated nodeids to 31-bits (positive signed integers)
        clear_node_high_bit: yes

        # Optionally assign a fixed node id (integer)
        # nodeid:         1234
        interface {
                ringnumber: 0
                bindnetaddr: 10.1.0.255
                mcastaddr: 226.94.1.55
                mcastport: 4001
        }
}

logging {
        fileline: off
        to_stderr: yes
        to_logfile: yes
        to_syslog: yes
        logfile: /var/log/corosync.log
        debug: off
        timestamp: on
        logger_subsys {
                subsys: AMF
                debug: off
        }
}

amf {
        mode: disabled
}

service {
        # Load the Pacemaker Cluster Resource Manager
        name: pacemaker
        ver:  0
}

Is this normal, or maybe a known issue?  Any help much appreciated!!!

BTW - if I try to shut down a VM via:

# crm resource stop my_vm_resource

crm_mon shows the same thing, as in "caweb-vd 
(ocf::heartbeat:VirtualDomain): Started hgvmcore1 (unmanaged) FAILED" 
and doesn't seem to cleanly exit, at least through pacemaker, even 
though the VM does shut down cleanly.  I never put it in 'unmanaged' 
mode, it seems to go into that state by itself.

Cheers,
-erich