[Pacemaker] Reboot of cluster members with heavy load on filesystem.

Wed Feb 6 18:58:10 UTC 2013

Hi.

We are running two clusters compounded of two machines. We are using DRBD + OCFS2 to make the common filesystem.
The the version of the components in use are:
OpenSuse 12.1
pacemaker-1.1.6-2.1.2.x86_64
corosync-1.4.1-2.1.3.x86_64
openais-1.1.4-9.7.3.x86_64
ocfs2-tools-1.8.0-4.1.2.x86_64
kernel downgraded to 2.6.37
4x Dell R610

We use one of the cluster to run our MySQL database and the other one to be our web server. On the web cluster we got two resources
mounted on /export and /backup.

The clusters run nice with normal load except when doing backup of files or optimize of the databases. At this time we got a huge
increment of data coming by the mysqldump to the backup resource or from the resource mounted on /export.
Sometimes when performing the backup or optimizing the database (done just on the mysql cluster), the Pacemaker declares a node dead
(but its not) and start the recovering process. When it happens we end up with two machines getting restarted and most of the times
with a database crash :-(

As you can see below, just about 30 seconds after the dump starts on diana the problem happens.
----------------------------------------------------------------
Feb  6 04:22:40 diana stonith-ng: [2917]: info: log_operation: resIPMI-1: Performing: stonith -t external/ipmi -S 
Feb  6 04:22:40 diana stonith-ng: [2917]: info: log_operation: resIPMI-1: success:  0
Feb  6 04:22:40 diana stonith-ng: [2917]: info: stonith_device_execute: Nothing to do for resIPMI-1
Feb  6 04:23:31 diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr) redirecting to systemctl
Feb  6 04:24:31 diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr) redirecting to systemctl
Feb  6 04:25:31 diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr) redirecting to systemctl
Feb  6 04:26:31 diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr) redirecting to systemctl
Feb  6 04:27:31 diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr) redirecting to systemctl
Feb  6 04:28:31 diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr) redirecting to systemctl
Feb  6 04:29:31 diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr) redirecting to systemctl
Feb  6 04:30:01 diana /USR/SBIN/CRON[1257]: (root) CMD (/root/scripts/bkp_database_diario.sh)
Feb  6 04:30:31 diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr) redirecting to systemctl
Feb  6 04:31:31 diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr) redirecting to systemctl
Feb  6 04:31:42 diana lrmd: [2919]: WARN: ip_intranet:0:monitor process (PID 1902) timed out (try 1).  Killing with signal SIGTERM
(15).
Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] CLM CONFIGURATION CHANGE
Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] New Configuration:
Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] #011r(0) ip(10.10.1.2) r(1) ip(10.10.10.9) 
Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] Members Left:
Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] #011r(0) ip(10.10.1.1) r(1) ip(10.10.10.8) 
Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] Members Joined:

----------------------------------------------------------------

Feb  6 04:30:32 apolo lrmd: [2855]: info: RA output: (httpd:0:monitor:stderr) redirecting to systemctl
Feb  6 04:31:32 apolo lrmd: [2855]: info: RA output: (httpd:0:monitor:stderr) redirecting to systemctl
Feb  6 04:31:41 apolo corosync[2848]:  [TOTEM ] A processor failed, forming new configuration.
Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] CLM CONFIGURATION CHANGE
Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] New Configuration:
Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] #011r(0) ip(10.10.1.1) r(1) ip(10.10.10.8) 
Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] Members Left:
Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] #011r(0) ip(10.10.1.2) r(1) ip(10.10.10.9) 
Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] Members Joined:
Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] notice: pcmk_peer_update: Transitional membership event on ring 304: memb=1, new=0,
lost=1
Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] info: pcmk_peer_update: memb: apolo 16845322
Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] info: pcmk_peer_update: lost: diana 33622538
Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] CLM CONFIGURATION CHANGE
Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] New Configuration:
Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] #011r(0) ip(10.10.1.1) r(1) ip(10.10.10.8) 
Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] Members Left:
Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] Members Joined:
Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] notice: pcmk_peer_update: Stable membership event on ring 304: memb=1, new=0, lost=0
Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] info: pcmk_peer_update: MEMB: apolo 16845322
Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] info: ais_mark_unseen_peer_dead: Node diana was not seen in the previous transition
Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] info: update_member: Node 33622538/diana is now: lost
Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] info: send_member_notification: Sending membership update 304 to 4 children
Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] Members Joined:
Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] notice: pcmk_peer_update: Transitional membership event on ring 304: memb=1, new=0,
lost=1
Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] info: pcmk_peer_update: memb: apolo 16845322
Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] info: pcmk_peer_update: lost: diana 33622538
Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] CLM CONFIGURATION CHANGE
Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] New Configuration:
Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] #011r(0) ip(10.10.1.1) r(1) ip(10.10.10.8) 
Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] Members Left:
Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] Members Joined:
Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] notice: pcmk_peer_update: Stable membership event on ring 304: memb=1, new=0, lost=0
Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] info: pcmk_peer_update: MEMB: apolo 16845322
Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] info: ais_mark_unseen_peer_dead: Node diana was not seen in the previous transition
Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] info: update_member: Node 33622538/diana is now: lost
Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] info: send_member_notification: Sending membership update 304 to 4 children
Feb  6 04:31:47 apolo cib: [2854]: notice: ais_dispatch_message: Membership 304: quorum lost
----------------------------------------------------------------

After lots of log apolo asks diana to reboot and sometime after that it got rebooted too.
We had an old cluster with heartbeat and DRBD used to cause it on that system but now looks like Pacemaker is the guilt.

Here is my Pacemaker and DRBD configuration
http://www2.connection.com.br/cbastos/pacemaker/crm_config
http://www2.connection.com.br/cbastos/pacemaker/drbd_conf/global_common.setup
http://www2.connection.com.br/cbastos/pacemaker/drbd_conf/backup.res
http://www2.connection.com.br/cbastos/pacemaker/drbd_conf/export.res

And more detailed logs
http://www2.connection.com.br/cbastos/pacemaker/reboot_apolo
http://www2.connection.com.br/cbastos/pacemaker/reboot_diana

Please is there a way to fix this? Maybe some timings adjustments can help?
I have no idea where to start.

Best Regards,
Carlos.