SBD killing both cluster nodes when there are even small SAN network problems

Posted by Wieslaw Herr on Server Fault See other posts from Server Fault or by Wieslaw Herr
Published on 2012-06-05T10:34:10Z Indexed on 2012/06/05 10:41 UTC
Read the original article Hit count: 387

Filed under:

openais

I am having problems with stonith SBD in a openais-based cluster.

Some background: The active/passive cluster has two nodes, node1 and node2. They are configured to provide an NFS service to users. To avoid problems with split-brain, they are both configured to use SBD. SBD is using two 1MB disks available to the hosts via an multipath fibre-channel network.

The problems start if something happens with the SAN network. For example, today one of the brocade switches got rebooted and both nodes lost 2 out of 4 paths to each disks, which resulted in both nodes committing suicide and rebooting. This, of course, was highly undesirable because a) there were paths left b) even if the switch would be out for 10-20 seconds a reboot cycle of both nodes would take 5-10 minutes and all NFS-locks would be lost.

I tried increasing the SBD timeout values (to 10sec+ values, dump attached at the end), however a "WARN: Latency: No liveness for 4 s exceeds threshold of 3 s" hints that something isn't working as I would it expect to.

Here is what I would like to know: a) Is SBD working as it should killing nodes when 2 paths are available? b) If not, is the multipath.conf file attached correct? The storage controller we use is an IBM SVC (IBM 2145), should there be any specific configuration for it? (as in multipath.conf.defaults) c) How should I go about increasing the timeouts in SBD

attachements: Multipath.conf and sbd dump (http://hpaste.org/69537)

Developer IT

SBD killing both cluster nodes when there are even small SAN network problems - Developer IT

SBD killing both cluster nodes when there are even small SAN network problems

san

clustering

high-availability

failover

openais

Related posts about san

Connection lost to iscsi san after problem with san

SAN performance issues storing SQL Server tempdb on a SAN that's being backed up

cPanel web servers mounting home partition to a NAS or SAN

SAN with iSCSI-Target Performance Horrendous

2 Server FC SAN Configuration

Related posts about clustering

agglomerative clustering java

MySQL Clustering in a Sandbox

Clustering for Mere Mortals (Pt2)

Microsoft SQL Server High-Availability Videos and Q&A Log

I need advice about iscsi + zfs(or ntfs) + windows 2008 clustering

Categories cloud