Heartbeat/DRBD failover didn't work as expected. How do I make the failover more robust?

Posted by Quinn Murphy on Server Fault See other posts from Server Fault or by Quinn Murphy
Published on 2012-06-19T20:09:05Z Indexed on 2012/06/19 21:18 UTC
Read the original article Hit count: 252

Filed under:
|
|
|

I had a scenario where a DRBD-heartbeat set up had a failed node but did not failover. What happened was the primary node had locked up, but didn't go down directly (it was inaccessible via ssh or with the nfs mount, but it could be pinged). The desired behavior would have been to detect this and failover to the secondary node, but it appears that since the primary didn't go full down (there is a dedicated network connection from server to server), heartbeat's detection mechanism didn't pick up on that and therefore didn't failover.

Has anyone seen this? Is there something that I need to configure to have more robust cluster failover? DRBD seems to otherwise work fine (had to resync when I rebooted the old primary), but without good failover, it's use is limited.

  • heartbeat 3.0.4
  • drbd84
  • RHEL 6.1
  • We are not using Pacemaker

nfs03 is the primary server in this setup, and nfs01 is the secondary.

ha.cf

  # Hearbeat Logging
logfacility daemon
udpport 694


ucast eth0 192.168.10.47
ucast eth0 192.168.10.42

# Cluster members
node nfs01.openair.com
node nfs03.openair.com

# Hearbeat communication timing.
# Sets the triggers and pulse time for swapping over.
keepalive 1
warntime 10
deadtime 30
initdead 120


#fail back automatically
auto_failback on

and here is the haresources file:

nfs03.openair.com   IPaddr::192.168.10.50/255.255.255.0/eth0      drbddisk::data  Filesystem::/dev/drbd0::/data::ext4 nfs nfslock

© Server Fault or respective owner

Related posts about linux

Related posts about redhat