[Pacemaker] Pacemaker failover delays (followup)

Fri Mar 8 23:50:10 CET 2013

Andrew,

Thanks for the feedback to my earlier questions from March 6th.  I've done some further investigation wrt the timing of what I'd call the "simple" failover case:   where an SSID that is master on the DC node is killed, and it takes 10-12 seconds before the slave SSID on the other node transitions to master.  (Recall that "SSID" is a SliceServer app instance, each of which is abstracted as a Pacemaker resource.)

Before going into my findings, I want to clear up a couple of misstatements on my part.

*         WRT my mention of "notifications" in my earlier e-mail, I misused the term.  I was simply referring to the "notify" events passed from the DC to the other node.

*         I also misspoke when I said that the failed SSID was subsequently restarted as a result of a monitor event.  In fact, the SSID process is restarted by the "ss" resource agent script in response  to a "start" event from lrmd.

The key issue, however, is the time required - 10 to 12 seconds - from the time the master SSID is killed until the slave fails over to become master.  You opined that the time required would largely depend upon the behavior of the resource agent, which in our case is a script called "ss".  To determine what effect the ss script's execution would be, I modified it to log the current monotonic system clock value each time it starts, and just before it exits.  The log messages specify the clock value in ms.