Linux HA cluster w/Xen, Heartbeat, Pacemaker. domU does not failover to secondary node

Posted by Kendall on Server Fault See other posts from Server Fault or by Kendall
Published on 2011-02-16T17:04:54Z Indexed on 2011/02/17 15:26 UTC
Read the original article Hit count: 605

Filed under:
|
|
|

I am having the followig problem with an OenSuSE + Heartbeat + Pacemaker + Xen HA cluster: when the node a Xen domU is running on is "dead" the Xen domU running on it is not restarted on the second node.

The cluster is setup with two nodes, each running OpenSuSE-11.3, Heartbeat 3.0, and Pacemaker 1.0 in CRM mode. For storage I am using a LUN on an iSCSI SAN device; the LUN is formatted with OCFS2 and managed with LVM. The Xen domU has two logical volumes; one for root and the other for swap. I am using IPMI cards for STONITH devices, and a dedicated ethernet link for heartbeat communications.

The ha.cf file is as follows:
keepalive 1
deadtime 10
warntime 5
udpport 694
ucast eth1
auto_failback off
node dhcp-166
node stage
use_logd yes
crm yes

My resources look as follows:
shocrm(live)configure# show
node $id="5c1aa924-bba4-4f95-a367-6c9a58ac4a38" dhcp-166
node $id="cebc92eb-af24-4833-aaf0-672adf80b58e" stage
primitive Xen-Util ocf:heartbeat:Xen \
meta target-role="Started" \
operations $id="Xen-Util-operations" \
op start interval="0" timeout="60" start-delay="0" \
op stop interval="0" timeout="120" \
params xmfile="/etc/xen/vm/xen-util"
primitive my-stonith stonith:external/ipmi \
params hostname="dhcp-166" ipaddr="192.168.3.106" userid="ADMIN" passwd="xxx" \
op monitor interval="2m" timeout="60s"
primitive my-stonith2 stonith:external/ipmi \
params hostname="stage" ipaddr="192.168.3.105" userid="ADMIN" passwd="xxx" \
op monitor interval="2m" timeout="60s"
property $id="cib-bootstrap-options" \
dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \
cluster-infrastructure="Heartbeat"

The Xen domU config file is as follows:

name = "xen-util"
bootloader = "/usr/lib/xen/boot/domUloader.py"
#bootargs = "xvda1:/vmlinuz-xen,/initrd-xen"
bootargs = "--entry=xvda1:/boot/vmlinuz-xen,/boot/initrd-xen"
memory = 4096
disk = [ 'phy:vg_xen/xen-util-root,xvda1,w',
'phy:vg_xen/xen-util-swap,xvda2,w', ]
root = "/dev/xvda1"
vif = [ 'mac=00:16:3e:42:42:06' ]
#vfb = [ 'type=vnc,vncunused=0,vnclisten=192.168.3.172' ]
extra = ""

Say domU "Xen-Util" is running on node "stage"; if "stage" goes down, "Xen-Util" does not restart on node "dhcp-166". It seems to want to try as an "xm list" will show it for a few seconds and if you "xm console xen-util" it will give a message like "copying /boot/kernel.gz from xvda1 to /var/lib/xen/tmp/kernel.a53gs for booting". However, it never gets past that, eventually gives up, and no longer appears in "xm list". Now, when node "stage" comes back online after being power cycled, it detects that "Xen-Util" isn't running, and starts it (on stage).

I've tried starting "Xen-Util" on node "dhcp-166" without the cluster running, and it works fine. No problems. So, I know it works in that respect.

Any ideas? Thanks!

© Server Fault or respective owner

Related posts about linux

Related posts about xen