[Pacemaker] [Question and Problem] In vSphere5.1 environment, IO blocking of pengine occurs at the time of shared disk trouble for a long time.

Mon May 13 06:14:24 UTC 2013

Hi All,

We constituted a simple cluster in environment of vSphere5.1.

We composed it of two ESXi servers and shared disk.

The guest located it to the shared disk.

Step 1) Constitute a cluster.(A DC node is an active node.)

============
Last updated: Mon May 13 14:16:09 2013
Stack: Heartbeat
Current DC: pgsr01 (85a81130-4fed-4932-ab4c-21ac2320186f) - partition with quorum
Version: 1.0.13-30bb726
2 Nodes configured, unknown expected votes
2 Resources configured.
============

Online: [ pgsr01 pgsr02 ]

 Resource Group: test-group
     Dummy1     (ocf::pacemaker:Dummy): Started pgsr01
     Dummy2     (ocf::pacemaker:Dummy): Started pgsr01
 Clone Set: clnPingd
     Started: [ pgsr01 pgsr02 ]

Node Attributes:
* Node pgsr01:
    + default_ping_set                  : 100       
* Node pgsr02:
    + default_ping_set                  : 100       

Migration summary:
* Node pgsr01: 
* Node pgsr02: 

Step 2) Strace does the pengine process of the DC node.

[root at pgsr01 ~]# ps -ef |grep heartbeat
root      2072     1  0 13:56 ?        00:00:00 heartbeat: master control process
root      2075  2072  0 13:56 ?        00:00:00 heartbeat: FIFO reader        
root      2076  2072  0 13:56 ?        00:00:00 heartbeat: write: bcast eth1  
root      2077  2072  0 13:56 ?        00:00:00 heartbeat: read: bcast eth1   
root      2078  2072  0 13:56 ?        00:00:00 heartbeat: write: bcast eth2  
root      2079  2072  0 13:56 ?        00:00:00 heartbeat: read: bcast eth2   
496       2082  2072  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/ccm
496       2083  2072  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/cib
root      2084  2072  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/lrmd -r
root      2085  2072  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/stonithd
496       2086  2072  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/attrd
496       2087  2072  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/crmd
496       2089  2087  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/pengine
root      2182     1  0 14:15 ?        00:00:00 /usr/lib64/heartbeat/pingd -D -p /var/run//pingd-default_ping_set -a default_ping_set -d 5s -m 100 -i 1 -h 192.168.101.254
root      2287  1973  0 14:16 pts/0    00:00:00 grep heartbea

[root at pgsr01 ~]# strace -p 2089
Process 2089 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 0
times({tms_utime=5, tms_stime=6, tms_cutime=0, tms_cstime=0}) = 429527557
recvfrom(5, 0xa93ff7, 953, 64, 0, 0)    = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=5, events=0}], 1, 0)          = 0 (Timeout)
recvfrom(5, 0xa93ff7, 953, 64, 0, 0)    = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=5, events=0}], 1, 0)          = 0 (Timeout)
(snip)

Step 3) Disconnect the shared disk which an active node was placed.

Step 4) Cut off pingd of the standby node. 
        The score of pingd is reflected definitely, but handling of pengine blocks it.

~ # esxcfg-vswitch -N vmnic1 -p "ap-db" vSwitch1
~ # esxcfg-vswitch -N vmnic2 -p "ap-db" vSwitch1

(snip)
brk(0xd05000)                           = 0xd05000
brk(0xeed000)                           = 0xeed000
brk(0xf2d000)                           = 0xf2d000
fstat(6, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f86a255a000
write(6, "BZh51AY&SY\327\373\370\203\0\t(_\200UPX\3\377\377%cT \277\377\377"..., 2243) = 2243
brk(0xb1d000)                           = 0xb1d000
fsync(6                                ------------------------------> BLOCKED
(snip)

============
Last updated: Mon May 13 14:19:15 2013
Stack: Heartbeat
Current DC: pgsr01 (85a81130-4fed-4932-ab4c-21ac2320186f) - partition with quorum
Version: 1.0.13-30bb726
2 Nodes configured, unknown expected votes
2 Resources configured.
============

Online: [ pgsr01 pgsr02 ]

 Resource Group: test-group
     Dummy1     (ocf::pacemaker:Dummy): Started pgsr01
     Dummy2     (ocf::pacemaker:Dummy): Started pgsr01
 Clone Set: clnPingd
     Started: [ pgsr01 pgsr02 ]

Node Attributes:
* Node pgsr01:
    + default_ping_set                  : 100       
* Node pgsr02:
    + default_ping_set                  : 0             : Connectivity is lost

Migration summary:
* Node pgsr01: 
* Node pgsr02: 

Step 4) Reconnect communication of pingd of the standby node.
        The score of pingd is reflected definitely, but handling of pengine blocks it.

~ # esxcfg-vswitch -M vmnic1 -p "ap-db" vSwitch1
~ # esxcfg-vswitch -M vmnic2 -p "ap-db" vSwitch1

============
Last updated: Mon May 13 14:19:40 2013
Stack: Heartbeat
Current DC: pgsr01 (85a81130-4fed-4932-ab4c-21ac2320186f) - partition with quorum
Version: 1.0.13-30bb726
2 Nodes configured, unknown expected votes
2 Resources configured.
============

Online: [ pgsr01 pgsr02 ]

 Resource Group: test-group
     Dummy1     (ocf::pacemaker:Dummy): Started pgsr01
     Dummy2     (ocf::pacemaker:Dummy): Started pgsr01
 Clone Set: clnPingd
     Started: [ pgsr01 pgsr02 ]

Node Attributes:
* Node pgsr01:
    + default_ping_set                  : 100       
* Node pgsr02:
    + default_ping_set                  : 100       

Migration summary:
* Node pgsr01: 
* Node pgsr02: 

 --------- A block state of pengine continues -----

Step 5) Cut off pingd of the active node. 
        The score of pingd is reflected definitely, but handling of pengine blocks it.

~ # esxcfg-vswitch -N vmnic1 -p "ap-db" vSwitch1
~ # esxcfg-vswitch -N vmnic2 -p "ap-db" vSwitch1

============
Last updated: Mon May 13 14:20:32 2013
Stack: Heartbeat
Current DC: pgsr01 (85a81130-4fed-4932-ab4c-21ac2320186f) - partition with quorum
Version: 1.0.13-30bb726
2 Nodes configured, unknown expected votes
2 Resources configured.
============

Online: [ pgsr01 pgsr02 ]

 Resource Group: test-group
     Dummy1     (ocf::pacemaker:Dummy): Started pgsr01
     Dummy2     (ocf::pacemaker:Dummy): Started pgsr01
 Clone Set: clnPingd
     Started: [ pgsr01 pgsr02 ]

Node Attributes:
* Node pgsr01:
    + default_ping_set                  : 0             : Connectivity is lost
* Node pgsr02:
    + default_ping_set                  : 100       

Migration summary:
* Node pgsr01: 
* Node pgsr02: 

 --------- A block state of pengine continues -----

After that the movement to the standby node of the resource does not happen because in condition transition is not made because a block state of pengine continues.
In the vSphere environment, time considerably passes, and blocking is canceled, and transition is generated.
 * The IO blocking of pengine seems to occur repeatedly
 * Other processes may be blocked, too.
 * It took it from trouble to FO completion more than one hour.

This problem shows that resource movement may not occur after disk trouble in vSphere environment.

Because our user thinks that I use Pacemaker in vSphere environment, the solution to this problem is necessary.

Do not you know the example which solved a similar problem on vSphere?

We think that it is necessary to evade a block of pengine if there is not a solution example.

For example...
 1. crmd watches a request to pengine with a timer...
 2. pengine writes in it with a timer and watches processing....
 ..etc...

 * This problem does not seem to occur in KVM.
 * There is the possibility of the difference of the hyper visor.
 * In addition, even an actual machine of Linux did not generate the problem.

Best Regards,
Hideo Yamauchi.