[Pacemaker] pacemaker error after a couple week or month (David Vossel)

Mon Dec 22 14:07:01 UTC 2014

----- Original Message -----
> Hello David,
> 
> I think I use the latest version from ubuntu, it is version 1.1.10
> Do you think it has bug on it?

There have been a number of fixes to the lrmd since v1.1.10. It is possible
a couple of them could result in crashes. Again, without a backtrace from
the lrmd core dump, it is difficult for me to advise whether or not your
specific issue has been fixed. Building from source could yield better results
for you. The pacemaker master branch is stable at the moment.

lrmd related changes since 1.1.10

# git log --oneline Pacemaker-1.1.10^..HEAD | grep -e "lrmd:"
71b429c Low: lrmd: fix regression test LSBdummy install
fb94901 Test: lrmd: Ensure the lsb script is executable
30d978e Low: lrmd: systemd stress tests
568e41d Fix: lrmd: Prevent glib assert triggered by timers being removed from mainloop more than once
977de97 High: lrmd: cancel pending async connection during disconnect
d2d0cba Low: lrmd: ensures systemd python package is available when systemd tests run
f0fe737 Fix: lrmd: fix rescheduling of systemd monitor op during start
c0e8e6a Low: lrmd: prevent \n from being printed in exit reason output
2342835 High: lrmd: pass exit reason prefix to ocf scripts as env variable
412631c High: lrmd: store failed operation exit reason in cib
ad083a8 Fix: lrmd: Log with the correct personality
718bf5b Test: lrmd: Update the systemd agent to test long running actions
c78b4b8 Fix: lrmd: Handle systemd reporting 'done' before a resource is actually stopped
3bd6c30 Fix: lrmd: Handle systemd reporting 'done' before a resource is actually stopped
574fc49 Fix: lrmd: Prevent OCF agents from logging to random files due to "value" of setenv() being NULL
155c6aa Low: lrmd: wider use of defined literals
fa8bd56 Fix: lrmd: Expose logging variables expected by OCF agents
d9cc751 Fix: lrmd: Provide stderr output from agents if available, otherwise fall back to stdout
3adc781 Low: lrmd: clean up the agent's entire process group
348bb51 Fix: lrmd: Cancel recurring operations before stop action is executed
fa2954e Low: lrmd: Warning msg to indicate duplicate op merge has occurred
b94d0e9 Low: lrmd: recurring op merger regression tests
c29ab27 High: lrmd: Merge duplicate recurring monitor operations
c1a326d Test: lrmd: Bump the lrmd test timeouts to avoid transient travis failures
deead39 Low: lrmd: Install ping agent during lrmd regression test.
aad79e2 Low: lrmd: Make ocf dummy agents executable with regression test in src tree
5c8c7a5 Test: lrmd: Kill uninstalled daemons by the correct name
8e90200 Test: lrmd: Fix upstart metadata test and install required OCF agents
bbdd6e1 Test: lrmd: Allow regression tests to run from the source tree
87f4091 Low: lrmd: Send event alerting estabilished clients that a new client connection is created.
644752e Fix: lrmd: Correctly calculate metadata for the 'service' class
ea7991f Fix: lrmd: Do not interrogate NULL replies from the server
1c14b9d Fix: lrmd: Correctly cancel monitor actions for lsb/systemd/service resources on cleaning up
eceeeea Doc: lrmd: Indicate which function recieves the proxied command
ad4056f Test: lrmd: Drop the default verbosity for lrmd regression tests
eb40d6a Fix: lrmd: Do not overwrite any existing operation status error

-- Vossel

> Should I compile from the source?
> 
> Best Regards,
> 
> 
> Ariee
> 
> 
> On Fri, Dec 19, 2014 at 8:27 PM, < pacemaker-request at oss.clusterlabs.org >
> wrote:
> 
> 
> Message: 2
> Date: Fri, 19 Dec 2014 14:21:59 -0500 (EST)
> From: David Vossel < dvossel at redhat.com >
> To: The Pacemaker cluster resource manager
> < pacemaker at oss.clusterlabs.org >
> Subject: Re: [Pacemaker] pacemaker error after a couple week or month
> Message-ID:
> < 102420175.739708.1419016919246.JavaMail.zimbra at redhat.com >
> Content-Type: text/plain; charset=utf-8
> 
> 
> 
> ----- Original Message -----
> > Hello,
> > 
> > I have 2 active-passive fail over system with corosync and drbd.
> > One system using 2 debian server and the other using 2 ubuntu server.
> > The debian servers are for web server fail over and the ubuntu servers are
> > for database server fail over.
> > 
> > I applied the same configuration in the pacemaker. Everything works fine,
> > fail over can be done nicely and also the file system synchronization, but
> > in the ubuntu server, it was always has error after a couple week or month.
> > The pacemaker in ubuntu1 had different status with ubuntu2, ubuntu1 assumed
> > that ubuntu2 was down and ubuntu2 assumed that something happened with
> > ubuntu1 but still alive and took over the resources. It made the drbd
> > resource cannot be taken over, thus no fail over happened and we must
> > manually restart the server because restarting pacemaker and corosync
> > didn't
> > help. I have changed the configuration of pacemaker a couple time, but the
> > problem still exist.
> > 
> > has anyone experienced it? I use Ubuntu 14.04.1 LTS.
> > 
> > I got this error in apport.log
> > 
> > ERROR: apport (pid 20361) Fri Dec 19 02:43:52 2014: executable:
> > /usr/lib/pacemaker/lrmd (command line "/usr/lib/pacemaker/lrmd")
> 
> wow, it looks like the lrmd is crashing on you. I haven't seen this occur
> in the wild before. Without a backtrace it will be nearly impossible to
> determine
> what is happening.
> 
> Do you have the ability to upgrade pacemaker to a newer version?
> 
> -- Vossel
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>