[ClusterLabs] How can I prevent multiple start of IPaddr 2 in an environment using fence_mpath?

Fri Apr 6 00:30:46 EDT 2018

Hi, all
I am testing the environment using fence_mpath with the following settings.

=======
  Stack: corosync
  Current DC: x3650f (version 1.1.17-1.el7-b36b869) - partition with quorum
  Last updated: Fri Apr  6 13:16:20 2018
  Last change: Thu Mar  1 18:38:02 2018 by root via cibadmin on x3650e

  2 nodes configured
  13 resources configured

  Online: [ x3650e x3650f ]

  Full list of resources:

   fenceMpath-x3650e    (stonith:fence_mpath):  Started x3650e
   fenceMpath-x3650f    (stonith:fence_mpath):  Started x3650f
   Resource Group: grpPostgreSQLDB
       prmFsPostgreSQLDB1       (ocf::heartbeat:Filesystem):    Started x3650e
       prmFsPostgreSQLDB2       (ocf::heartbeat:Filesystem):    Started x3650e
       prmFsPostgreSQLDB3       (ocf::heartbeat:Filesystem):    Started x3650e
       prmApPostgreSQLDB        (ocf::heartbeat:pgsql): Started x3650e
   Resource Group: grpPostgreSQLIP
       prmIpPostgreSQLDB        (ocf::heartbeat:IPaddr2):       Started x3650e
   Clone Set: clnDiskd1 [prmDiskd1]
       Started: [ x3650e x3650f ]
   Clone Set: clnDiskd2 [prmDiskd2]
       Started: [ x3650e x3650f ]
   Clone Set: clnPing [prmPing]
       Started: [ x3650e x3650f ]
=======

When split-brain occurs in this environment, x3650f executes fence and the resource is started with x3650f.

=== view of x3650e ====
  Stack: corosync
  Current DC: x3650e (version 1.1.17-1.el7-b36b869) - partition WITHOUT quorum
  Last updated: Fri Apr  6 13:16:36 2018
  Last change: Thu Mar  1 18:38:02 2018 by root via cibadmin on x3650e

  2 nodes configured
  13 resources configured

  Node x3650f: UNCLEAN (offline)
  Online: [ x3650e ]

  Full list of resources:

   fenceMpath-x3650e    (stonith:fence_mpath):  Started x3650e
   fenceMpath-x3650f    (stonith:fence_mpath):  Started[ x3650e x3650f ]
   Resource Group: grpPostgreSQLDB
       prmFsPostgreSQLDB1       (ocf::heartbeat:Filesystem):    Started x3650e
       prmFsPostgreSQLDB2       (ocf::heartbeat:Filesystem):    Started x3650e
       prmFsPostgreSQLDB3       (ocf::heartbeat:Filesystem):    Started x3650e
       prmApPostgreSQLDB        (ocf::heartbeat:pgsql): Started x3650e
   Resource Group: grpPostgreSQLIP
       prmIpPostgreSQLDB        (ocf::heartbeat:IPaddr2):       Started x3650e
   Clone Set: clnDiskd1 [prmDiskd1]
       prmDiskd1        (ocf::pacemaker:diskd): Started x3650f (UNCLEAN)
       Started: [ x3650e ]
   Clone Set: clnDiskd2 [prmDiskd2]
       prmDiskd2        (ocf::pacemaker:diskd): Started x3650f (UNCLEAN)
       Started: [ x3650e ]
   Clone Set: clnPing [prmPing]
       prmPing  (ocf::pacemaker:ping):  Started x3650f (UNCLEAN)
       Started: [ x3650e ]

=== view of x3650f ====
  Stack: corosync
  Current DC: x3650f (version 1.1.17-1.el7-b36b869) - partition WITHOUT quorum
  Last updated: Fri Apr  6 13:16:36 2018
  Last change: Thu Mar  1 18:38:02 2018 by root via cibadmin on x3650e

  2 nodes configured
  13 resources configured

  Online: [ x3650f ]
  OFFLINE: [ x3650e ]

  Full list of resources:

   fenceMpath-x3650e    (stonith:fence_mpath):  Started x3650f
   fenceMpath-x3650f    (stonith:fence_mpath):  Started x3650f
   Resource Group: grpPostgreSQLDB
       prmFsPostgreSQLDB1       (ocf::heartbeat:Filesystem):    Started x3650f
       prmFsPostgreSQLDB2       (ocf::heartbeat:Filesystem):    Started x3650f
       prmFsPostgreSQLDB3       (ocf::heartbeat:Filesystem):    Started x3650f
       prmApPostgreSQLDB        (ocf::heartbeat:pgsql): Started x3650f
   Resource Group: grpPostgreSQLIP
       prmIpPostgreSQLDB        (ocf::heartbeat:IPaddr2):       Started x3650f
   Clone Set: clnDiskd1 [prmDiskd1]
       Started: [ x3650f ]
       Stopped: [ x3650e ]
   Clone Set: clnDiskd2 [prmDiskd2]
       Started: [ x3650f ]
       Stopped: [ x3650e ]
   Clone Set: clnPing [prmPing]
       Started: [ x3650f ]
       Stopped: [ x3650e ]
=======

However, IPaddr2 of x3650e will not stop until pgsql monitor error occurs.
At this time, IPaddr2 is temporarily started on two nodes.

=== view of after pgsql monitor error ===
  Stack: corosync
  Current DC: x3650e (version 1.1.17-1.el7-b36b869) - partition WITHOUT quorum
  Last updated: Fri Apr  6 13:16:56 2018
  Last change: Thu Mar  1 18:38:02 2018 by root via cibadmin on x3650e

  2 nodes configured
  13 resources configured

  Node x3650f: UNCLEAN (offline)
  Online: [ x3650e ]

  Full list of resources:

   fenceMpath-x3650e    (stonith:fence_mpath):  Started x3650e
   fenceMpath-x3650f    (stonith:fence_mpath):  Started[ x3650e x3650f ]
   Resource Group: grpPostgreSQLDB
       prmFsPostgreSQLDB1       (ocf::heartbeat:Filesystem):    Started x3650e
       prmFsPostgreSQLDB2       (ocf::heartbeat:Filesystem):    Started x3650e
       prmFsPostgreSQLDB3       (ocf::heartbeat:Filesystem):    Started x3650e
       prmApPostgreSQLDB        (ocf::heartbeat:pgsql): Stopped
   Resource Group: grpPostgreSQLIP
       prmIpPostgreSQLDB        (ocf::heartbeat:IPaddr2):       Stopped
   Clone Set: clnDiskd1 [prmDiskd1]
       prmDiskd1        (ocf::pacemaker:diskd): Started x3650f (UNCLEAN)
       Started: [ x3650e ]
   Clone Set: clnDiskd2 [prmDiskd2]
       prmDiskd2        (ocf::pacemaker:diskd): Started x3650f (UNCLEAN)
       Started: [ x3650e ]
   Clone Set: clnPing [prmPing]
       prmPing  (ocf::pacemaker:ping):  Started x3650f (UNCLEAN)
       Started: [ x3650e ]

  Node Attributes:
  * Node x3650e:
      + default_ping_set                        : 100
      + diskcheck_status                        : normal
      + diskcheck_status_internal               : normal

  Migration Summary:
  * Node x3650e:
     prmApPostgreSQLDB: migration-threshold=1 fail-count=1 last-failure='Fri Apr  6 13:16:39 2018'

  Failed Actions:
  * prmApPostgreSQLDB_monitor_10000 on x3650e 'not running' (7): call=60, status=complete, exitreason='Configuration file /dbfp/pgdata/data/postgresql.conf doesn't exist',
      last-rc-change='Fri Apr  6 13:16:39 2018', queued=0ms, exec=0ms
======

We regard this behavior as a problem.
Is there a way to avoid this behavior?

Regards, Yusuke