[Pacemaker] [Question and Problem] In vSphere5.1 environment, IO blocking of pengine occurs at the time of shared disk trouble for a long time.

Wed May 15 05:31:29 UTC 2013

Hi Andrew,

> > Thank you for comments.
> > 
> >>> The guest located it to the shared disk.
> >> 
> >> What is on the shared disk?  The whole OS or app-specific data (i.e. nothing pacemaker needs directly)?
> > 
> > Shared disk has all the OS and the all data.
> 
> Oh. I can imagine that being problematic.
> Pacemaker really isn't designed to function without disk access.

I think so, too.

I thought so, and I did the following suggestion.

> >>> For example...
> >>> 1. crmd watches a request to pengine with a timer...
> >>> 2. pengine writes in it with a timer and watches processing....
> >>> ..etc...

But, there may be a better method.

> 
> You might be able to get away with it if you turn off saving PE files to disk though.
> 
> > The placement of this shared disk is similar in KVM where the problem does not occur.
> 
> That it works in KVM in this situation is kind of surprising.
> Or perhaps I misunderstand.

About the movement on KVM, I confirm the details once again.
However, the movement on KVM is clearly different from the movement on vSphere5.1.

Best Regards,
Hideo Yamauchi.

> 
> > 
> > * We understand that we are different in movement in the difference of the hyper visor.
> > * However, it seems to be necessary to evade this problem to use Pacemaker in vSphere5.1 environment.
> > 
> > Best Regards,
> > Hideo Yamauchi.
> > 
> > 
> > --- On Wed, 2013/5/15, Andrew Beekhof <andrew at beekhof.net> wrote:
> > 
> >> 
> >> On 13/05/2013, at 4:14 PM, renayama19661014 at ybb.ne.jp wrote:
> >> 
> >>> Hi All,
> >>> 
> >>> We constituted a simple cluster in environment of vSphere5.1.
> >>> 
> >>> We composed it of two ESXi servers and shared disk.
> >>> 
> >>> The guest located it to the shared disk.
> >> 
> >> What is on the shared disk?  The whole OS or app-specific data (i.e. nothing pacemaker needs directly)?
> >> 
> >>> 
> >>> 
> >>> Step 1) Constitute a cluster.(A DC node is an active node.)
> >>> 
> >>> ============
> >>> Last updated: Mon May 13 14:16:09 2013
> >>> Stack: Heartbeat
> >>> Current DC: pgsr01 (85a81130-4fed-4932-ab4c-21ac2320186f) - partition with quorum
> >>> Version: 1.0.13-30bb726
> >>> 2 Nodes configured, unknown expected votes
> >>> 2 Resources configured.
> >>> ============
> >>> 
> >>> Online: [ pgsr01 pgsr02 ]
> >>> 
> >>> Resource Group: test-group
> >>>      Dummy1     (ocf::pacemaker:Dummy): Started pgsr01
> >>>      Dummy2     (ocf::pacemaker:Dummy): Started pgsr01
> >>> Clone Set: clnPingd
> >>>      Started: [ pgsr01 pgsr02 ]
> >>> 
> >>> Node Attributes:
> >>> * Node pgsr01:
> >>>     + default_ping_set                  : 100       
> >>> * Node pgsr02:
> >>>     + default_ping_set                  : 100       
> >>> 
> >>> Migration summary:
> >>> * Node pgsr01: 
> >>> * Node pgsr02: 
> >>> 
> >>> 
> >>> Step 2) Strace does the pengine process of the DC node.
> >>> 
> >>> [root at pgsr01 ~]# ps -ef |grep heartbeat
> >>> root      2072     1  0 13:56 ?        00:00:00 heartbeat: master control process
> >>> root      2075  2072  0 13:56 ?        00:00:00 heartbeat: FIFO reader        
> >>> root      2076  2072  0 13:56 ?        00:00:00 heartbeat: write: bcast eth1  
> >>> root      2077  2072  0 13:56 ?        00:00:00 heartbeat: read: bcast eth1   
> >>> root      2078  2072  0 13:56 ?        00:00:00 heartbeat: write: bcast eth2  
> >>> root      2079  2072  0 13:56 ?        00:00:00 heartbeat: read: bcast eth2   
> >>> 496       2082  2072  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/ccm
> >>> 496       2083  2072  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/cib
> >>> root      2084  2072  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/lrmd -r
> >>> root      2085  2072  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/stonithd
> >>> 496       2086  2072  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/attrd
> >>> 496       2087  2072  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/crmd
> >>> 496       2089  2087  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/pengine
> >>> root      2182     1  0 14:15 ?        00:00:00 /usr/lib64/heartbeat/pingd -D -p /var/run//pingd-default_ping_set -a default_ping_set -d 5s -m 100 -i 1 -h 192.168.101.254
> >>> root      2287  1973  0 14:16 pts/0    00:00:00 grep heartbea
> >>> 
> >>> [root at pgsr01 ~]# strace -p 2089
> >>> Process 2089 attached - interrupt to quit
> >>> restart_syscall(<... resuming interrupted call ...>) = 0
> >>> times({tms_utime=5, tms_stime=6, tms_cutime=0, tms_cstime=0}) = 429527557
> >>> recvfrom(5, 0xa93ff7, 953, 64, 0, 0)    = -1 EAGAIN (Resource temporarily unavailable)
> >>> poll([{fd=5, events=0}], 1, 0)          = 0 (Timeout)
> >>> recvfrom(5, 0xa93ff7, 953, 64, 0, 0)    = -1 EAGAIN (Resource temporarily unavailable)
> >>> poll([{fd=5, events=0}], 1, 0)          = 0 (Timeout)
> >>> (snip)
> >>> 
> >>> 
> >>> Step 3) Disconnect the shared disk which an active node was placed.
> >>> 
> >>> Step 4) Cut off pingd of the standby node. 
> >>>         The score of pingd is reflected definitely, but handling of pengine blocks it.
> >>> 
> >>> ~ # esxcfg-vswitch -N vmnic1 -p "ap-db" vSwitch1
> >>> ~ # esxcfg-vswitch -N vmnic2 -p "ap-db" vSwitch1
> >>> 
> >>> 
> >>> (snip)
> >>> brk(0xd05000)                           = 0xd05000
> >>> brk(0xeed000)                           = 0xeed000
> >>> brk(0xf2d000)                           = 0xf2d000
> >>> fstat(6, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
> >>> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f86a255a000
> >>> write(6, "BZh51AY&SY\327\373\370\203\0\t(_\200UPX\3\377\377%cT \277\377\377"..., 2243) = 2243
> >>> brk(0xb1d000)                           = 0xb1d000
> >>> fsync(6                                ------------------------------> BLOCKED
> >>> (snip)
> >>> 
> >>> 
> >>> ============
> >>> Last updated: Mon May 13 14:19:15 2013
> >>> Stack: Heartbeat
> >>> Current DC: pgsr01 (85a81130-4fed-4932-ab4c-21ac2320186f) - partition with quorum
> >>> Version: 1.0.13-30bb726
> >>> 2 Nodes configured, unknown expected votes
> >>> 2 Resources configured.
> >>> ============
> >>> 
> >>> Online: [ pgsr01 pgsr02 ]
> >>> 
> >>> Resource Group: test-group
> >>>      Dummy1     (ocf::pacemaker:Dummy): Started pgsr01
> >>>      Dummy2     (ocf::pacemaker:Dummy): Started pgsr01
> >>> Clone Set: clnPingd
> >>>      Started: [ pgsr01 pgsr02 ]
> >>> 
> >>> Node Attributes:
> >>> * Node pgsr01:
> >>>     + default_ping_set                  : 100       
> >>> * Node pgsr02:
> >>>     + default_ping_set                  : 0             : Connectivity is lost
> >>> 
> >>> Migration summary:
> >>> * Node pgsr01: 
> >>> * Node pgsr02: 
> >>> 
> >>> 
> >>> Step 4) Reconnect communication of pingd of the standby node.
> >>>         The score of pingd is reflected definitely, but handling of pengine blocks it.
> >>> 
> >>> 
> >>> ~ # esxcfg-vswitch -M vmnic1 -p "ap-db" vSwitch1
> >>> ~ # esxcfg-vswitch -M vmnic2 -p "ap-db" vSwitch1
> >>> 
> >>> ============
> >>> Last updated: Mon May 13 14:19:40 2013
> >>> Stack: Heartbeat
> >>> Current DC: pgsr01 (85a81130-4fed-4932-ab4c-21ac2320186f) - partition with quorum
> >>> Version: 1.0.13-30bb726
> >>> 2 Nodes configured, unknown expected votes
> >>> 2 Resources configured.
> >>> ============
> >>> 
> >>> Online: [ pgsr01 pgsr02 ]
> >>> 
> >>> Resource Group: test-group
> >>>      Dummy1     (ocf::pacemaker:Dummy): Started pgsr01
> >>>      Dummy2     (ocf::pacemaker:Dummy): Started pgsr01
> >>> Clone Set: clnPingd
> >>>      Started: [ pgsr01 pgsr02 ]
> >>> 
> >>> Node Attributes:
> >>> * Node pgsr01:
> >>>     + default_ping_set                  : 100       
> >>> * Node pgsr02:
> >>>     + default_ping_set                  : 100       
> >>> 
> >>> Migration summary:
> >>> * Node pgsr01: 
> >>> * Node pgsr02: 
> >>> 
> >>> 
> >>> --------- A block state of pengine continues -----
> >>> 
> >>> Step 5) Cut off pingd of the active node. 
> >>>         The score of pingd is reflected definitely, but handling of pengine blocks it.
> >>> 
> >>> 
> >>> ~ # esxcfg-vswitch -N vmnic1 -p "ap-db" vSwitch1
> >>> ~ # esxcfg-vswitch -N vmnic2 -p "ap-db" vSwitch1
> >>> 
> >>> 
> >>> ============
> >>> Last updated: Mon May 13 14:20:32 2013
> >>> Stack: Heartbeat
> >>> Current DC: pgsr01 (85a81130-4fed-4932-ab4c-21ac2320186f) - partition with quorum
> >>> Version: 1.0.13-30bb726
> >>> 2 Nodes configured, unknown expected votes
> >>> 2 Resources configured.
> >>> ============
> >>> 
> >>> Online: [ pgsr01 pgsr02 ]
> >>> 
> >>> Resource Group: test-group
> >>>      Dummy1     (ocf::pacemaker:Dummy): Started pgsr01
> >>>      Dummy2     (ocf::pacemaker:Dummy): Started pgsr01
> >>> Clone Set: clnPingd
> >>>      Started: [ pgsr01 pgsr02 ]
> >>> 
> >>> Node Attributes:
> >>> * Node pgsr01:
> >>>     + default_ping_set                  : 0             : Connectivity is lost
> >>> * Node pgsr02:
> >>>     + default_ping_set                  : 100       
> >>> 
> >>> Migration summary:
> >>> * Node pgsr01: 
> >>> * Node pgsr02: 
> >>> 
> >>> --------- A block state of pengine continues -----
> >>> 
> >>> 
> >>> After that the movement to the standby node of the resource does not happen because in condition transition is not made because a block state of pengine continues.
> >>> In the vSphere environment, time considerably passes, and blocking is canceled, and transition is generated.
> >>> * The IO blocking of pengine seems to occur repeatedly
> >>> * Other processes may be blocked, too.
> >>> * It took it from trouble to FO completion more than one hour.
> >>> 
> >>> This problem shows that resource movement may not occur after disk trouble in vSphere environment.
> >>> 
> >>> Because our user thinks that I use Pacemaker in vSphere environment, the solution to this problem is necessary.
> >>> 
> >>> Do not you know the example which solved a similar problem on vSphere?
> >>> 
> >>> We think that it is necessary to evade a block of pengine if there is not a solution example.
> >>> 
> >>> For example...
> >>> 1. crmd watches a request to pengine with a timer...
> >>> 2. pengine writes in it with a timer and watches processing....
> >>> ..etc...
> >>> 
> >>> * This problem does not seem to occur in KVM.
> >>> * There is the possibility of the difference of the hyper visor.
> >>> * In addition, even an actual machine of Linux did not generate the problem.
> >>> 
> >>> 
> >>> Best Regards,
> >>> Hideo Yamauchi.
> >>> 
> >>> 
> >>> 
> >>> _______________________________________________
> >>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>> 
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> Bugs: http://bugs.clusterlabs.org
> >> 
> >> 
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
>